2025-05-20-12-13

Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions

Abstract

arXiv:2505.11614v1 Announce Type: new Abstract: A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models--capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.

摘要

认知建模的核心目标是开发不仅能预测人类行为、还能揭示潜在认知机制的模型。尽管基于大规模行为数据训练的神经网络模型通常具有强大的预测性能，但它们往往无法对所捕捉的认知过程提供可解释的说明。本研究探索了预训练大语言模型（LLMs）作为双重用途认知模型的潜力——既能实现准确预测，又能以自然语言提供可解释的说明。具体而言，我们采用基于结果奖励的强化学习来引导LLMs生成显式推理轨迹，用以解释人类风险决策。研究结果表明，该方法在提供人类决策强大量化预测的同时，还能产生高质量的解释性说明。

PeerGuard: Defending Multi-Agent Systems Against Backdoor Attacks Through Mutual Reasoning

Abstract

arXiv:2505.11642v1 Announce Type: new Abstract: Multi-agent systems leverage advanced AI models as autonomous agents that interact, cooperate, or compete to complete complex tasks across applications such as robotics and traffic management. Despite their growing importance, safety in multi-agent systems remains largely underexplored, with most research focusing on single AI models rather than interacting agents. This work investigates backdoor vulnerabilities in multi-agent systems and proposes a defense mechanism based on agent interactions. By leveraging reasoning abilities, each agent evaluates responses from others to detect illogical reasoning processes, which indicate poisoned agents. Experiments on LLM-based multi-agent systems, including ChatGPT series and Llama 3, demonstrate the effectiveness of the proposed method, achieving high accuracy in identifying poisoned agents while minimizing false positives on clean agents. We believe this work provides insights into multi-agent system safety and contributes to the development of robust, trustworthy AI interactions.

摘要

多智能体系统利用先进的人工智能模型作为自主智能体，通过交互、合作或竞争完成机器人学和交通管理等应用中的复杂任务。尽管其重要性日益凸显，多智能体系统的安全性研究仍严重不足，现有工作多集中于单一AI模型而非交互智能体。本研究探讨多智能体系统中的后门漏洞，并提出基于智能体交互的防御机制。通过运用推理能力，每个智能体可评估其他智能体的响应以检测异常推理过程，从而识别被污染智能体。在基于大语言模型的多智能体系统（包括ChatGPT系列和Llama 3）上的实验表明，该方法能有效识别被污染智能体且对正常智能体误判率极低。我们相信这项工作为多智能体系统安全研究提供了新视角，有助于发展鲁棒、可信的人工智能交互。

FLOW-BENCH: Towards Conversational Generation of Enterprise Workflows

Abstract

arXiv:2505.11646v1 Announce Type: new Abstract: Business process automation (BPA) that leverages Large Language Models (LLMs) to convert natural language (NL) instructions into structured business process artifacts is becoming a hot research topic. This paper makes two technical contributions -- (i) FLOW-BENCH, a high quality dataset of paired natural language instructions and structured business process definitions to evaluate NL-based BPA tools, and support bourgeoning research in this area, and (ii) FLOW-GEN, our approach to utilize LLMs to translate natural language into an intermediate representation with Python syntax that facilitates final conversion into widely adopted business process definition languages, such as BPMN and DMN. We bootstrap FLOW-BENCH by demonstrating how it can be used to evaluate the components of FLOW-GEN across eight LLMs of varying sizes. We hope that FLOW-GEN and FLOW-BENCH catalyze further research in BPA making it more accessible to novice and expert users.

摘要

利用大型语言模型（LLMs）将自然语言（NL）指令转化为结构化业务流程制品的业务流程自动化（BPA）正成为研究热点。本文提出两项技术贡献：（i）FLOW-BENCH——一个高质量的自然语言指令与结构化业务流程定义配对数据集，用于评估基于NL的BPA工具，并支持该领域新兴研究；（ii）FLOW-GEN——我们提出的方法，通过LLMs将自然语言转换为具有Python语法的中间表示，从而促进最终转化为广泛采用的业务流程定义语言（如BPMN和DMN）。我们通过展示如何利用FLOW-BENCH评估FLOW-GEN在八个不同规模LLMs中的组件性能，实现了该数据集的初始构建。期望FLOW-GEN和FLOW-BENCH能推动BPA领域的进一步研究，使其更易于新手和专家用户使用。

Probing the Vulnerability of Large Language Models to Polysemantic Interventions

Abstract

arXiv:2505.11611v1 Announce Type: new Abstract: Polysemanticity -- where individual neurons encode multiple unrelated features -- is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.

摘要

多义性——即单个神经元编码多个无关特征的现象——是大型神经网络的一个显著特征，也始终是语言模型可解释性研究的核心挑战。与此同时，人们对其在模型安全性方面的影响也知之甚少。借助稀疏自编码器的最新进展，我们研究了两个小型模型（Pythia-70M和GPT-2-Small）的多义性结构，并评估了它们在提示、特征、标记和神经元层面上遭受针对性隐蔽干预的脆弱性。分析揭示了两模型共有的稳定多义性拓扑结构。引人注目的是，我们证明这种结构可被用于对两个更大的黑盒指令微调模型（LLaMA3.1-8B-Instruct和Gemma-2-9B-Instruct）实施有效干预。这些发现不仅表明干预措施具有普适性，更指向了一种稳定且可迁移的多义性结构——这种结构可能在不同架构和训练方案中持续存在。

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

Abstract

arXiv:2505.11730v1 Announce Type: new Abstract: Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1% over Beam Search and 3.6% over Best-of-N, while reducing FLOPs by over 52%. We will open-source the code to support future research.

摘要

测试时缩放（TTS）技术已被证明能有效增强大语言模型（LLMs）的推理能力。验证环节在TTS中起着关键作用，其质量与计算成本同时影响着（1）推理性能与（2）计算效率。本研究突破传统验证范式，首次系统探究验证粒度（即验证器在生成过程中被调用的频率，而非仅验证最终输出或单步生成）的影响机制。为此，我们提出可变粒度搜索算法（VG-Search），该统一算法通过可调粒度参数g泛化了束搜索与N最佳采样。在不同计算预算、生成器-验证器配置及任务属性下的实验表明：动态选择g能提升计算效率与缩放性能。基于此，我们提出自适应VG-Search策略，相比束搜索和N最佳采样分别实现最高3.1%和3.6%的准确率提升，同时减少52%以上的浮点运算量。相关代码将开源以支持后续研究。

DMN-Guided Prompting: A Low-Code Framework for Controlling LLM Behavior

Abstract

arXiv:2505.11701v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown considerable potential in automating decision logic within knowledge-intensive processes. However, their effectiveness largely depends on the strategy and quality of prompting. Since decision logic is typically embedded in prompts, it becomes challenging for end users to modify or refine it. Decision Model and Notation (DMN) offers a standardized graphical approach for defining decision logic in a structured, user-friendly manner. This paper introduces a DMN-guided prompting framework that breaks down complex decision logic into smaller, manageable components, guiding LLMs through structured decision pathways. We implemented the framework in a graduate-level course where students submitted assignments. The assignments and DMN models representing feedback instructions served as inputs to our framework. The instructor evaluated the generated feedback and labeled it for performance assessment. Our approach demonstrated promising results, outperforming chain-of-thought (CoT) prompting. Students also responded positively to the generated feedback, reporting high levels of perceived usefulness in a survey based on the Technology Acceptance Model.

摘要

大语言模型（LLMs）在自动化知识密集型流程中的决策逻辑方面展现出显著潜力，但其效能很大程度上依赖于提示策略与质量。由于决策逻辑通常嵌入于提示中，终端用户难以对其进行修改或优化。决策模型与标记法（DMN）提供了一种标准化的图形化方法，能以结构化且用户友好的方式定义决策逻辑。本文提出一种DMN引导的提示框架，将复杂决策逻辑分解为更小、更易管理的组件，通过结构化决策路径引导LLMs。我们在研究生课程中实施该框架，学生提交作业后，作业内容和代表反馈指令的DMN模型作为框架输入。授课教师对生成的反馈进行评估并标注性能指标。该方法表现出优于思维链（CoT）提示的效果，且学生基于技术接受模型的调查反馈显示，他们对生成反馈的感知有用性评价较高。

LLM Agents Are Hypersensitive to Nudges

Abstract

arXiv:2505.11584v1 Announce Type: new Abstract: LLMs are being set loose in complex, real-world environments involving sequential decision-making and tool use. Often, this involves making choices on behalf of human users. However, not much is known about the distribution of such choices, and how susceptible they are to different choice architectures. We perform a case study with a few such LLM models on a multi-attribute tabular decision-making problem, under canonical nudges such as the default option, suggestions, and information highlighting, as well as additional prompting strategies. We show that, despite superficial similarities to human choice distributions, such models differ in subtle but important ways. First, they show much higher susceptibility to the nudges. Second, they diverge in points earned, being affected by factors like the idiosyncrasy of available prizes. Third, they diverge in information acquisition strategies: e.g. incurring substantial cost to reveal too much information, or selecting without revealing any. Moreover, we show that simple prompt strategies like zero-shot chain of thought (CoT) can shift the choice distribution, and few-shot prompting with human data can induce greater alignment. Yet, none of these methods resolve the sensitivity of these models to nudges. Finally, we show how optimal nudges optimized with a human resource-rational model can similarly increase LLM performance for some models. All these findings suggest that behavioral tests are needed before deploying models as agents or assistants acting on behalf of users in complex environments.

摘要

大型语言模型（LLMs）正被部署于涉及序列决策和工具使用的复杂现实环境中。这类场景通常需要模型代表人类用户做出选择。然而，目前对此类选择的分布特征及其对不同选择架构的敏感性仍缺乏深入研究。我们针对若干LLM模型开展案例研究，通过多属性表格决策任务，考察默认选项、建议提示、信息突显等经典助推手段及额外提示策略的影响。研究发现，尽管这些模型的表面选择分布与人类存在相似性，但在细微而关键的维度上存在差异：首先，它们对助推手段表现出更高的敏感性；其次，在收益获取上存在偏离，易受奖品特异性等因素影响；第三，其信息获取策略显著不同，例如可能付出高昂成本获取过量信息，或在未揭示任何信息时直接选择。实验还表明，零样本思维链（CoT）等简单提示策略能改变选择分布，而基于人类数据的少样本提示可提升对齐性，但这些方法均未能消除模型对助推的敏感性。最后，我们证明基于人类资源理性模型优化的最佳助推策略同样能提升部分LLM的性能。这些发现共同表明，在将模型作为用户代理部署于复杂环境前，必须进行行为测试。

Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing

Abstract

arXiv:2505.11743v1 Announce Type: new Abstract: With the rapid development of cloud computing systems and the increasing complexity of their infrastructure, intelligent mechanisms to detect and mitigate failures in real time are becoming increasingly important. Traditional methods of failure detection are often difficult to cope with the scale and dynamics of modern cloud environments. In this study, we propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems. The model combines existing machine learning fault detection algorithms with LLM's natural language understanding capabilities to process and parse system logs, error reports, and real-time data streams through semantic context. The method adopts a multi-level architecture, combined with supervised learning for fault classification and unsupervised learning for anomaly detection, so that the system can predict potential failures before they occur and automatically trigger the self-healing mechanism. Experimental results show that the proposed model is significantly better than the traditional fault detection system in terms of fault detection accuracy, system downtime reduction and recovery speed.

摘要

随着云计算系统的快速发展和基础设施日益复杂，实时检测与缓解故障的智能机制变得愈发重要。传统故障检测方法往往难以应对现代云环境的规模与动态性。本研究提出一种基于大语言模型（LLM）的新型人工智能框架，用于实现云系统智能故障检测与自愈机制。该模型将现有机器学习故障检测算法与LLM的自然语言理解能力相结合，通过语义上下文处理解析系统日志、错误报告和实时数据流。该方法采用多层架构，结合监督学习进行故障分类和无监督学习进行异常检测，使系统能够在潜在故障发生前进行预测并自动触发自愈机制。实验结果表明，所提模型在故障检测准确率、系统停机时间缩减及恢复速度方面显著优于传统故障检测系统。

Heart2Mind: Human-Centered Contestable Psychiatric Disorder Diagnosis System using Wearable ECG Monitors

Abstract

arXiv:2505.11612v1 Announce Type: new Abstract: Psychiatric disorders affect millions globally, yet their diagnosis faces significant challenges in clinical practice due to subjective assessments and accessibility concerns, leading to potential delays in treatment. To help address this issue, we present Heart2Mind, a human-centered contestable psychiatric disorder diagnosis system using wearable electrocardiogram (ECG) monitors. Our approach leverages cardiac biomarkers, particularly heart rate variability (HRV) and R-R intervals (RRI) time series, as objective indicators of autonomic dysfunction in psychiatric conditions. The system comprises three key components: (1) a Cardiac Monitoring Interface (CMI) for real-time data acquisition from Polar H9/H10 devices; (2) a Multi-Scale Temporal-Frequency Transformer (MSTFT) that processes RRI time series through integrated time-frequency domain analysis; (3) a Contestable Diagnosis Interface (CDI) combining Self-Adversarial Explanations (SAEs) with contestable Large Language Models (LLMs). Our MSTFT achieves 91.7% accuracy on the HRV-ACC dataset using leave-one-out cross-validation, outperforming state-of-the-art methods. SAEs successfully detect inconsistencies in model predictions by comparing attention-based and gradient-based explanations, while LLMs enable clinicians to validate correct predictions and contest erroneous ones. This work demonstrates the feasibility of combining wearable technology with Explainable Artificial Intelligence (XAI) and contestable LLMs to create a transparent, contestable system for psychiatric diagnosis that maintains clinical oversight while leveraging advanced AI capabilities. Our implementation is publicly available at: https://github.com/Analytics-Everywhere-Lab/heart2mind.

摘要

精神障碍影响着全球数百万人，但由于临床实践中主观评估和可及性问题，其诊断面临重大挑战，可能导致治疗延迟。为应对这一问题，我们提出Heart2Mind——一种基于可穿戴心电图（ECG）监测设备、以人为中心且具有可争议性的精神障碍诊断系统。该方法利用心脏生物标志物（特别是心率变异性HRV和R-R间期RRI时间序列）作为精神疾病自主神经功能障碍的客观指标。系统包含三个核心组件：(1) 用于从Polar H9/H10设备实时获取数据的心脏监测接口（CMI）；(2) 通过时频域联合分析处理RRI时间序列的多尺度时频变换器（MSTFT）；(3) 将自对抗解释（SAEs）与可争议大语言模型（LLMs）相结合的争议诊断接口（CDI）。我们的MSTFT在HRV-ACC数据集上采用留一法交叉验证达到91.7%准确率，优于现有最优方法。SAEs通过比较基于注意力和梯度的解释成功检测模型预测不一致性，而LLMs使临床医生能验证正确预测并质疑错误结果。这项工作证明了将可穿戴技术与可解释人工智能（XAI）及可争议LLMs相结合，构建透明、可争议精神障碍诊断系统的可行性，该系统在利用先进AI能力的同时保持临床监督。实现代码已公开于：https://github.com/Analytics-Everywhere-Lab/heart2mind。

OMAC: A Broad Optimization Framework for LLM-Based Multi-Agent Collaboration

Abstract

arXiv:2505.11765v1 Announce Type: new Abstract: Agents powered by advanced large language models (LLMs) have demonstrated impressive capabilities across diverse complex applications. Recently, Multi-Agent Systems (MAS), wherein multiple agents collaborate and communicate with each other, have exhibited enhanced capabilities in complex tasks, such as high-quality code generation and arithmetic reasoning. However, the development of such systems often relies on handcrafted methods, and the literature on systematic design and optimization of LLM-based MAS remains limited. In this work, we introduce OMAC, a general framework designed for holistic optimization of LLM-based MAS. Specifically, we identify five key optimization dimensions for MAS, encompassing both agent functionality and collaboration structure. Building upon these dimensions, we first propose a general algorithm, utilizing two actors termed the Semantic Initializer and the Contrastive Comparator, to optimize any single dimension. Then, we present an algorithm for joint optimization across multiple dimensions. Extensive experiments demonstrate the superior performance of OMAC on code generation, arithmetic reasoning, and general reasoning tasks against state-of-the-art approaches.

摘要

基于先进大语言模型（LLM）的智能体已在多样化的复杂应用中展现出卓越能力。近期，多智能体系统（MAS）通过智能体间的协作与通信，在代码生成和算术推理等复杂任务中表现出增强性能。然而，此类系统的开发通常依赖手工方法，关于基于LLM的MAS系统化设计与优化的研究仍较为有限。本研究提出OMAC框架，旨在实现基于LLM的MAS整体优化。具体而言，我们识别出MAS的五个关键优化维度，涵盖智能体功能与协作结构。基于这些维度，首先提出通用算法——利用语义初始化器和对比比较器两个执行组件——以优化单一维度；继而提出跨维度联合优化算法。大量实验表明，OMAC在代码生成、算术推理和通用推理任务上的性能显著优于现有最优方法。

REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning

Abstract

arXiv:2505.11718v1 Announce Type: new Abstract: AI-based peer review systems tend to produce shallow and overpraising suggestions compared to human feedback. Here, we evaluate how well a reasoning LLM trained with multi-objective reinforcement learning (REMOR) can overcome these limitations. We start by designing a multi-aspect reward function that aligns with human evaluation of reviews. The aspects are related to the review itself (e.g., criticisms, novelty) and the relationship between the review and the manuscript (i.e., relevance). First, we perform supervised fine-tuning of DeepSeek-R1-Distill-Qwen-7B using LoRA on PeerRT, a new dataset of high-quality top AI conference reviews enriched with reasoning traces. We then apply Group Relative Policy Optimization (GRPO) to train two models: REMOR-H (with the human-aligned reward) and REMOR-U (with a uniform reward). Interestingly, the human-aligned reward penalizes aspects typically associated with strong reviews, leading REMOR-U to produce qualitatively more substantive feedback. Our results show that REMOR-U and REMOR-H achieve more than twice the average rewards of human reviews, non-reasoning state-of-the-art agentic multi-modal AI review systems, and general commercial LLM baselines. We found that while the best AI and human reviews are comparable in quality, REMOR avoids the long tail of low-quality human reviews. We discuss how reasoning is key to achieving these improvements and release the Human-aligned Peer Review Reward (HPRR) function, the Peer Review Reasoning-enriched Traces (PeerRT) dataset, and the REMOR models, which we believe can help spur progress in the area.

摘要

基于AI的同行评审系统往往会产生比人类反馈更肤浅且过度褒扬的建议。本研究评估了采用多目标强化学习(REMOR)训练的逻辑推理大语言模型如何克服这些局限。我们首先设计了一个与人类评审评价标准一致的多维度奖励函数，这些维度涉及评审本身特性(如批评性、新颖性)以及评审与稿件间关联性(即相关性)。研究首先在PeerRT数据集上使用LoRA方法对DeepSeek-R1-Distill-Qwen-7B模型进行监督微调，该数据集是富含推理痕迹的顶级AI会议高质量评审新数据集。随后应用组相对策略优化(GRPO)训练了两个模型：REMOR-H(采用人类对齐奖励)和REMOR-U(采用均匀奖励)。有趣的是，人类对齐奖励会惩罚通常与优质评审相关的维度，这使得REMOR-U能产生质量上更具实质性的反馈。结果表明，REMOR-U和REMOR-H获得的平均奖励超过人类评审、非推理型最先进多模态AI评审系统及通用商业大语言模型基线两倍以上。研究发现，虽然最佳AI评审与人类评审质量相当，但REMOR避免了人类评审中常见的低质量长尾现象。我们论证了逻辑推理是实现这些改进的关键，并开源了人类对齐同行评审奖励函数(HPRR)、富含推理的同行评审痕迹数据集(PeerRT)及REMOR模型，这些资源有望推动该领域发展。

Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges

Abstract

arXiv:2505.11618v1 Announce Type: new Abstract: Spatiotemporal reasoning plays a key role in Cyber-Physical Systems (CPS). Despite advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs), their capacity to reason about complex spatiotemporal signals remains underexplored. This paper proposes a hierarchical SpatioTemporal reAsoning benchmaRK, STARK, to systematically evaluate LLMs across three levels of reasoning complexity: state estimation (e.g., predicting field variables, localizing and tracking events in space and time), spatiotemporal reasoning over states (e.g., inferring spatial-temporal relationships), and world-knowledge-aware reasoning that integrates contextual and domain knowledge (e.g., intent prediction, landmark-aware navigation). We curate 26 distinct spatiotemporal tasks with diverse sensor modalities, comprising 14,552 challenges where models answer directly or by Python Code Interpreter. Evaluating 3 LRMs and 8 LLMs, we find LLMs achieve limited success in tasks requiring geometric reasoning (e.g., multilateration or triangulation), particularly as complexity increases. Surprisingly, LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods. Our results show that in reasoning tasks requiring world knowledge, the performance gap between LLMs and LRMs narrows, with some LLMs even surpassing LRMs. However, the LRM o3 model continues to achieve leading performance across all evaluated tasks, a result attributed primarily to the larger size of the reasoning models. STARK motivates future innovations in model architectures and reasoning paradigms for intelligent CPS by providing a structured framework to identify limitations in the spatiotemporal reasoning of LLMs and LRMs.

摘要

时空推理在信息物理系统（CPS）中具有关键作用。尽管大语言模型（LLM）和大推理模型（LRM）取得了进展，但其对复杂时空信号的推理能力仍待深入探索。本文提出分层时空推理基准STARK，系统评估LLM在三个推理复杂度层级的表现：状态估计（如预测场变量、时空事件定位与追踪）、基于状态的时空推理（如推断时空关系）以及融合上下文与领域知识的世界知识感知推理（如意图预测、地标感知导航）。我们构建了涵盖26种传感器模态的时空任务，包含14,552个挑战项，模型可通过直接回答或Python代码解释器完成。通过评估3个LRM和8个LLM，发现LLM在需要几何推理的任务（如多边测量或三角定位）中成功率有限，且随复杂度增加表现显著下降。值得注意的是，LRM在不同难度任务中均表现出鲁棒性，常优于或媲美传统基于第一性原理的方法。研究表明，在需要世界知识的推理任务中，LLM与LRM的性能差距缩小，部分LLM甚至超越LRM。但LRM o3模型在所有评估任务中持续保持领先优势，这主要归因于其更大的模型规模。STARK通过结构化框架揭示了LLM和LRM在时空推理中的局限性，为智能CPS的模型架构与推理范式创新提供了方向。

Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

Abstract

arXiv:2505.11788v1 Announce Type: new Abstract: To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206 $\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.

摘要

为支持基于语言的应用程序利用分散异构计算资源，混合语言模型（HLM）提供了一种前景广阔的架构：设备端的小型语言模型（SLM）生成候选标记，由远程大型语言模型（LLM）进行验证和校正。然而原始HLM存在显著通信开销，因为LLM要求SLM为每个标记上传完整的词汇表概率分布。此外，当LLM验证极可能被接受的标记时，通信与计算资源均被浪费。为克服这些局限，我们提出通信高效且不确定性感知的HLM（CU-HLM）。在CU-HLM中，SLM仅在其输出不确定性较高时传输截断的词汇表分布。通过发现SLM不确定性与LLM拒绝概率间的强相关性，我们验证了这种机会式传输的可行性。进一步，我们理论推导出最优不确定性阈值与最优词汇表截断策略。仿真结果表明：相较于标准HLM，CU-HLM通过跳过74.8%的传输并实现97.4%的词汇表压缩，使标记吞吐量提升最高达206倍，同时保持97.4%的准确率。

ChatHTN: Interleaving Approximate (LLM) and Symbolic HTN Planning

Abstract

arXiv:2505.11814v1 Announce Type: new Abstract: We introduce ChatHTN, a Hierarchical Task Network (HTN) planner that combines symbolic HTN planning techniques with queries to ChatGPT to approximate solutions in the form of task decompositions. The resulting hierarchies interleave task decompositions generated by symbolic HTN planning with those generated by ChatGPT. Despite the approximate nature of the results generates by ChatGPT, ChatHTN is provably sound; any plan it generates correctly achieves the input tasks. We demonstrate this property with an open-source implementation of our system.

摘要

我们提出ChatHTN——一种结合符号化分层任务网络（HTN）规划技术与ChatGPT查询的分层任务网络规划器，其通过任务分解形式生成近似解。该体系结构交替整合符号化HTN规划生成的任务分解与ChatGPT产生的分解方案。尽管ChatGPT生成的结果具有近似性，但ChatHTN具有可证明的可靠性：其生成的任何计划都能正确完成输入任务。我们通过系统的开源实现验证了这一特性。

On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study

Abstract

arXiv:2505.11839v1 Announce Type: new Abstract: Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate 11 datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM behavior across each decompositional stage and identify how modality type and intermediate reasoning influence performance. By establishing a structured framework for analyzing counterfactual reasoning, this work contributes to the development of more reliable LLM-based reasoning systems and informs future elicitation strategies.

摘要

反事实推理已成为增强大语言模型（LLMs）推理能力的关键技术。通过生成和分析反事实场景，研究者能够评估模型决策的适应性与可靠性。尽管已有研究表明LLMs在反事实推理中常表现不佳，但何种因素对不同任务和模态下的性能阻碍最大仍不明确。本文提出一种分解策略，将反事实生成过程从因果构建拆解至反事实干预的推理阶段。为支持分解分析，我们研究了涵盖自然语言理解、数学、编程及视觉语言任务等11个数据集。通过大规模评估，我们刻画了LLMs在各分解阶段的行为特征，并揭示了模态类型与中间推理如何影响性能。本研究通过建立反事实推理的结构化分析框架，为开发更可靠的基于LLM的推理系统提供了基础，同时为未来能力激发策略提供了理论依据。

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

Abstract

arXiv:2505.11792v1 Announce Type: new Abstract: Optimization modeling is fundamental to decision-making across diverse domains.Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models due to hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing Large Reasoning Models, we present Solver-Informed Reinforcement Learning (SIRL).This novel framework leverages external optimization solvers as verifiable reward mechanisms to significantly improve the authenticity of LLMs for optimization modeling.Acting as precise verifiers, these solvers automatically assess the executable code and the instance-level mathematical model represented by the associated LP file, yielding precise and comprehensive feedback signals -- including syntax, feasibility, and solution quality that directly inform the RL process. This automated verification process, powered by classic optimization solvers, also underpins our instance-enhanced self-consistency method to synthesize high-quality training data. Extensive experiments on diverse public benchmarks demonstrate that SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models.

摘要

优化建模是跨领域决策制定的基础。尽管从自然语言描述自动生成优化模型的研究已取得进展，但大型语言模型（LLMs）常因幻觉问题难以生成形式正确且可用的模型，这为可靠自动化带来了挑战。受强化学习（RL）在增强大型推理模型方面成功的启发，我们提出求解器知情强化学习（SIRL）框架。该创新方法利用外部优化求解器作为可验证的奖励机制，显著提升LLMs在优化建模中的真实性。这些求解器作为精确验证器，能自动评估可执行代码及关联LP文件所表示的实例级数学模型，产生精确全面的反馈信号——包括语法、可行性和解质量等直接指导RL过程的信息。这种由经典优化求解器驱动的自动化验证过程，还支撑了我们提出的实例增强自洽方法，用于合成高质量训练数据。在多样化公共基准测试上的大量实验表明，SIRL实现了最先进的性能，在生成准确且可执行的优化模型方面显著优于现有方法。

ToLeaP: Rethinking Development of Tool Learning with Large Language Models

Abstract

arXiv:2505.11833v1 Announce Type: new Abstract: Tool learning, which enables large language models (LLMs) to utilize external tools effectively, has garnered increasing attention for its potential to revolutionize productivity across industries. Despite rapid development in tool learning, key challenges and opportunities remain understudied, limiting deeper insights and future advancements. In this paper, we investigate the tool learning ability of 41 prevalent LLMs by reproducing 33 benchmarks and enabling one-click evaluation for seven of them, forming a Tool Learning Platform named ToLeaP. We also collect 21 out of 33 potential training datasets to facilitate future exploration. After analyzing over 3,000 bad cases of 41 LLMs based on ToLeaP, we identify four main critical challenges: (1) benchmark limitations induce both the neglect and lack of (2) autonomous learning, (3) generalization, and (4) long-horizon task-solving capabilities of LLMs. To aid future advancements, we take a step further toward exploring potential directions, namely (1) real-world benchmark construction, (2) compatibility-aware autonomous learning, (3) rationale learning by thinking, and (4) identifying and recalling key clues. The preliminary experiments demonstrate their effectiveness, highlighting the need for further research and exploration.

摘要

工具学习通过使大语言模型（LLMs）能够有效利用外部工具，因其在各行业革新生产力的潜力而受到越来越多的关注。尽管工具学习发展迅速，但关键挑战与机遇仍未得到充分研究，这限制了对该领域更深层次见解和未来进展的探索。本文通过复现33个基准测试并对其中7个实现一键评估（构建名为ToLeaP的工具学习平台），调查了41个主流LLMs的工具学习能力。我们还收集了33个潜在训练数据集中的21个，以促进未来研究。基于ToLeaP平台分析41个LLMs的3000余个失败案例后，我们识别出四大核心挑战：(1) 基准测试的局限性导致LLMs在(2)自主学习、(3)泛化能力及(4)长程任务解决能力方面存在缺失与不足。为推进未来发展，我们进一步探索了四个潜在方向：(1) 真实场景基准构建、(2) 兼容性感知的自主学习、(3) 通过思维推演进行原理学习、(4) 关键线索识别与召回。初步实验验证了这些方向的有效性，凸显了进一步研究与探索的必要性。

Abstract

arXiv:2505.11861v1 Announce Type: new Abstract: Human preference plays a crucial role in the refinement of large language models (LLMs). However, collecting human preference feedback is costly and most existing datasets neglect the correlation between personalization and preferences. To address this issue, we introduce Fair-PP, a synthetic dataset of personalized preferences targeting social equity, derived from real-world social survey data, which includes 28 social groups, 98 equity topics, and 5 personal preference dimensions. Leveraging GPT-4o-mini, we engage in role-playing based on seven representative persona portrayals guided by existing social survey data, yielding a total of 238,623 preference records. Through Fair-PP, we also contribute (i) An automated framework for generating preference data, along with a more fine-grained dataset of personalized preferences; (ii) analysis of the positioning of the existing mainstream LLMs across five major global regions within the personalized preference space; and (iii) a sample reweighting method for personalized preference alignment, enabling alignment with a target persona while maximizing the divergence from other personas. Empirical experiments show our method outperforms the baselines.

摘要

人类偏好在大型语言模型（LLM）的优化过程中起着关键作用。然而，收集人类偏好反馈成本高昂，且现有数据集大多忽视了个性化与偏好之间的关联性。为解决这一问题，我们提出了Fair-PP——一个针对社会公平的合成个性化偏好数据集，该数据集源自真实世界的社会调查数据，涵盖28个社会群体、98个公平议题和5个个人偏好维度。基于GPT-4o-mini，我们根据现有社会调查数据指导的七种代表性人物画像进行角色扮演，最终生成238,623条偏好记录。通过Fair-PP，我们还贡献了：（i）一个自动化偏好数据生成框架，以及更细粒度的个性化偏好数据集；（ii）对现有主流LLM在五大全球区域个性化偏好空间中的定位分析；（iii）一种面向个性化偏好对齐的样本重加权方法，可实现与目标人物对齐的同时最大化与其他人物画像的差异性。实证实验表明，我们的方法优于基线模型。

VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation

Abstract

arXiv:2505.11849v1 Announce Type: new Abstract: Automating Register Transfer Level (RTL) code generation using Large Language Models (LLMs) offers substantial promise for streamlining digital circuit design and reducing human effort. However, current LLM-based approaches face significant challenges with training data scarcity, poor specification-code alignment, lack of verification mechanisms, and balancing generalization with specialization. Inspired by DeepSeek-R1, we introduce VeriReason, a framework integrating supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning for RTL generation. Using curated training examples and a feedback-driven reward model, VeriReason combines testbench evaluations with structural heuristics while embedding self-checking capabilities for autonomous error correction. On the VerilogEval Benchmark, VeriReason delivers significant improvements: achieving 83.1% functional correctness on the VerilogEval Machine benchmark, substantially outperforming both comparable-sized models and much larger commercial systems like GPT-4 Turbo. Additionally, our approach demonstrates up to a 2.8X increase in first-attempt functional correctness compared to baseline methods and exhibits robust generalization to unseen designs. To our knowledge, VeriReason represents the first system to successfully integrate explicit reasoning capabilities with reinforcement learning for Verilog generation, establishing a new state-of-the-art for automated RTL synthesis. The models and datasets are available at: https://huggingface.co/collections/AI4EDA-CASE Code is Available at: https://github.com/NellyW8/VeriReason

摘要

利用大语言模型（LLMs）自动化寄存器传输级（RTL）代码生成为简化数字电路设计、降低人力成本提供了巨大潜力。然而，当前基于LLM的方法面临训练数据稀缺、规范与代码对齐不佳、缺乏验证机制以及泛化与专业化平衡等重大挑战。受DeepSeek-R1启发，我们提出VeriReason框架，该框架将监督微调与引导奖励近端优化（GRPO）强化学习相结合，用于RTL生成。通过精选训练样本和反馈驱动的奖励模型，VeriReason将测试平台评估与结构启发式方法相结合，同时嵌入自检能力以实现自主纠错。在VerilogEval基准测试中，VeriReason表现出显著提升：在VerilogEval Machine基准上实现83.1%的功能正确率，大幅优于同规模模型及GPT-4 Turbo等大型商业系统。此外，相较于基线方法，我们的方案首次尝试功能正确率最高提升2.8倍，并对未见设计展现出强大泛化能力。据我们所知，VeriReason是首个成功将显式推理能力与强化学习结合用于Verilog生成的系统，为自动化RTL合成确立了新标杆。

MLLM-based Discovery of Intrinsic Coordinates and Governing Equations from High-Dimensional Data

Abstract

arXiv:2505.11940v1 Announce Type: new Abstract: Discovering governing equations from scientific data is crucial for understanding the evolution of systems, and is typically framed as a search problem within a candidate equation space. However, the high-dimensional nature of dynamical systems leads to an exponentially expanding equation space, making the search process extremely challenging. The visual perception and pre-trained scientific knowledge of multimodal large language models (MLLM) hold promise for providing effective navigation in high-dimensional equation spaces. In this paper, we propose a zero-shot method based on MLLM for automatically discovering physical coordinates and governing equations from high-dimensional data. Specifically, we design a series of enhanced visual prompts for MLLM to enhance its spatial perception. In addition, MLLM's domain knowledge is employed to navigate the search process within the equation space. Quantitative and qualitative evaluations on two representative types of systems demonstrate that the proposed method effectively discovers the physical coordinates and equations from both simulated and real experimental data, with long-term extrapolation accuracy improved by approximately 26.96% compared to the baseline.

摘要

从科学数据中发现支配方程对于理解系统演化至关重要，通常被构建为候选方程空间中的搜索问题。然而，动力系统的高维特性导致方程空间呈指数级扩张，使得搜索过程极具挑战性。多模态大语言模型（MLLM）的视觉感知与预训练科学知识有望为高维方程空间提供有效导航。本文提出一种基于MLLM的零样本方法，用于从高维数据中自动发现物理坐标与支配方程。具体而言，我们设计了一系列增强型视觉提示以提升MLLM的空间感知能力，并利用其领域知识引导方程空间内的搜索过程。通过对两类典型系统的定量与定性评估表明，该方法能有效从仿真和真实实验数据中发现物理坐标与方程，其长期外推精度较基线方法提升约26.96%。

LLM-Enhanced Feature Engineering for Multi-Factor Electricity Price Predictions

Abstract

arXiv:2505.11890v1 Announce Type: new Abstract: Accurately forecasting electricity price volatility is crucial for effective risk management and decision-making. Traditional forecasting models often fall short in capturing the complex, non-linear dynamics of electricity markets, particularly when external factors like weather conditions and market volatility are involved. These limitations hinder their ability to provide reliable predictions in markets with high volatility, such as the New South Wales (NSW) electricity market. To address these challenges, we introduce FAEP, a Feature-Augmented Electricity Price Prediction framework. FAEP leverages Large Language Models (LLMs) combined with advanced feature engineering to enhance prediction accuracy. By incorporating external features such as weather data and price volatility jumps, and utilizing Retrieval-Augmented Generation (RAG) for effective feature extraction, FAEP overcomes the shortcomings of traditional approaches. A hybrid XGBoost-LSTM model in FAEP further refines these augmented features, resulting in a more robust prediction framework. Experimental results demonstrate that FAEP achieves state-of-art (SOTA) performance compared to other electricity price prediction models in the Australian New South Wale electricity market, showcasing the efficiency of LLM-enhanced feature engineering and hybrid machine learning architectures.

摘要

准确预测电价波动对于有效的风险管理和决策制定至关重要。传统预测模型往往难以捕捉电力市场中复杂的非线性动态特性，特别是在涉及天气条件和市场波动等外部因素时。这些局限性导致其无法在澳大利亚新南威尔士州（NSW）等高波动性电力市场提供可靠预测。为解决这些问题，我们提出了FAEP框架——一种基于特征增强的电价预测方法。该框架通过结合大型语言模型（LLMs）与先进特征工程技术来提升预测精度，具体包括整合天气数据和价格波动跳跃等外部特征，并采用检索增强生成（RAG）技术实现高效特征提取。FAEP中的XGBoost-LSTM混合模型进一步优化了这些增强特征，从而构建出更稳健的预测框架。实验结果表明，在澳大利亚新南威尔士电力市场中，FAEP相比其他电价预测模型实现了最先进（SOTA）的性能，充分证明了LLM增强的特征工程与混合机器学习架构的有效性。

Evaluating the Logical Reasoning Abilities of Large Reasoning Models

Abstract

arXiv:2505.11854v1 Announce Type: new Abstract: Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across reasoning types and formats, highlighting limitations in their generalization. Our analysis reveals that human performance does not mirror model failure distributions. To foster further research, we curate LogiEval-Hard, a challenging subset identified through a novel screening paradigm where small-model failures (Qwen3-30B-A3B) reliably predict difficulties for larger models. Modern models show striking, consistent failures on LogiEval-Hard. This demonstrates that fundamental reasoning bottlenecks persist across model scales, and establishes LogiEval-Hard as both a diagnostic tool and a rigorous testbed for advancing logical reasoning in LLMs.

摘要

大规模推理模型通常通过长链思维（long CoT）数据的强化学习后训练，在数学、编程和特定领域推理基准测试中达到最先进性能。然而，其逻辑推理能力——作为人类认知基础且独立于领域知识的核心特质——仍未得到充分研究。为填补这一空白，我们提出LogiEval，一个用于评估大规模推理模型逻辑推理能力的综合基准。LogiEval涵盖演绎、归纳、类比和溯因等多元推理类型，以及逻辑序列、论点分析等多种任务形式，数据源自LSAT、GMAT等高质量人类考试。实验表明，现代推理模型在四选一论点分析问题和类比推理上表现优异甚至超越人类，但在不同推理类型和任务形式间存在能力不均，凸显其泛化局限。分析揭示人类表现与模型失败分布并不一致。为推动研究，我们通过新型筛选范式构建LogiEval-Hard挑战子集：小模型（Qwen3-30B-A3B）的失败可稳定预测大模型面临的困难。现代模型在LogiEval-Hard上表现出显著且一致的失败模式，证实基础推理瓶颈在不同规模模型中持续存在，同时确立该子集作为诊断工具和推进大语言模型逻辑推理研究的严格测试平台。

LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners

Abstract

arXiv:2505.11942v1 Announce Type: new Abstract: Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents. It provides skill-grounded, interdependent tasks across three interactive environments, Database, Operating System, and Knowledge Graph, with automatic label verification, reproducibility, and modular extensibility. Extensive experiments reveal that conventional experience replay has limited effectiveness for LLM agents due to irrelevant information and context length constraints. We further introduce a group self-consistency mechanism that significantly improves lifelong learning performance. We hope LifelongAgentBench will advance the development of adaptive, memory-capable LLM agents.

摘要

终身学习对于在动态环境中运行的智能体至关重要。然而当前基于大语言模型（LLM）的智能体仍处于无状态模式，无法随时间积累或迁移知识。现有基准测试将智能体视为静态系统，未能评估其终身学习能力。我们提出LifelongAgentBench——首个用于系统评估LLM智能体终身学习能力的统一基准，该基准在数据库、操作系统和知识图谱三个交互环境中提供技能导向的相互依存任务，具备自动标签验证、可复现性和模块化可扩展性。大量实验表明，由于无关信息和上下文长度限制，传统经验回放方法对LLM智能体效果有限。我们进一步提出群体自洽机制，可显著提升终身学习性能。期望LifelongAgentBench能推动具备记忆能力的自适应LLM智能体发展。

Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture

Abstract

arXiv:2505.11916v1 Announce Type: new Abstract: Existing large language models (LLMs) serving systems typically employ Prefill-Decode disaggregated architecture to prevent computational interference between the prefill and decode phases. However, real-world LLM serving scenarios often exhibit significant fluctuations in request input/output lengths, causing traditional static prefill/decode node configuration ratio to result in imbalanced computational loads between these two nodes, consequently preventing efficient utilization of computing resources to improve the system's goodput. To address this challenge, we design and implement Arrow, an adaptive scheduler that leverages stateless instances and elastic instance pools to achieve efficient adaptive request and instance scheduling. Arrow dynamically adjusts the number of instances handling prefill and decode tasks based on real-time cluster performance metrics, significantly enhancing the system's capability to handle traffic spikes and load variations. Our evaluation under diverse real-world workloads shows that Arrow achieves up to $5.62 \times$ and $7.78 \times$ higher request serving rates compared to state-of-the-art PD-colocated and PD-disaggregated serving systems respectively.

摘要

现有大型语言模型（LLM）服务系统通常采用预填充-解码分离架构以避免两阶段间的计算干扰。然而实际应用中，请求输入/输出长度常呈现显著波动，导致传统静态节点配比引发计算负载失衡，从而阻碍计算资源的高效利用与系统吞吐提升。为此，我们设计并实现了自适应调度系统Arrow，通过无状态实例与弹性实例池实现高效的请求-实例动态调度。该系统基于实时集群性能指标动态调整预填充与解码任务实例数量，显著增强了系统应对流量峰值与负载波动的能力。多样化实际工作负载测试表明，相比当前最优的共置部署与分离部署系统，Arrow分别实现了5.62倍与7.78倍的请求处理速率提升。

LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation

Abstract

arXiv:2505.12031v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) have sparked considerable interest in automated theorem proving and a prominent line of research integrates stepwise LLM-based provers into tree search. In this paper, we introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states, thereby facilitating effective one-shot fine-tuning of LLM as the policy model. We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search. Evaluations on the MiniF2F and ProofNet benchmarks demonstrate that our method outperforms strong baselines under the stringent Pass@1 metric, attaining an average pass rate of $60.74\%$ on MiniF2F and $21.18\%$ on ProofNet. These results underscore the impact of large-scale synthetic data in advancing automated theorem proving.

摘要

大语言模型（LLMs）的最新进展引发了人们对自动定理证明的广泛兴趣，当前主流研究将基于LLM的逐步证明器集成到树搜索中。本文提出了一种新颖的证明状态探索方法用于训练数据合成，该方法旨在生成覆盖广泛中间证明状态的多样化策略，从而实现对LLM作为策略模型的有效单次微调。我们还提出了一种自适应束宽策略，该策略充分利用我们的数据合成方法，在树搜索过程中实现探索与利用的平衡。在MiniF2F和ProofNet基准测试上的评估表明，我们的方法在严格的Pass@1指标下优于强基线模型，在MiniF2F上达到60.74%的平均通过率，在ProofNet上达到21.18%。这些结果凸显了大规模合成数据对推进自动定理证明领域的重要作用。

Abstract

arXiv:2505.12006v1 Announce Type: new Abstract: This paper introduces SOCIA (Simulation Orchestration for Cyber-physical-social Intelligence and Agents), a novel end-to-end framework leveraging Large Language Model (LLM)-based multi-agent systems to automate the generation of high-fidelity Cyber-Physical-Social (CPS) simulators. Addressing the challenges of labor-intensive manual simulator development and complex data calibration, SOCIA integrates a centralized orchestration manager that coordinates specialized agents for tasks including data comprehension, code generation, simulation execution, and iterative evaluation-feedback loops. Through empirical evaluations across diverse CPS tasks, such as mask adoption behavior simulation (social), personal mobility generation (physical), and user modeling (cyber), SOCIA demonstrates its ability to produce high-fidelity, scalable simulations with reduced human intervention. These results highlight SOCIA's potential to offer a scalable solution for studying complex CPS phenomena

摘要

本文介绍了一种新型端到端框架SOCIA（面向信息物理社会智能体与系统的仿真编排系统），该框架基于大语言模型（LLM）的多智能体系统，实现了高保真信息物理社会（CPS）模拟器的自动化生成。针对人工开发模拟器劳动密集和数据校准复杂等挑战，SOCIA通过集成中央编排管理器，协调数据理解、代码生成、仿真执行和迭代评估-反馈循环等专项任务的智能体。通过在口罩佩戴行为模拟（社会层面）、个人移动轨迹生成（物理层面）和用户建模（信息层面）等多样化CPS任务中的实证评估表明，SOCIA能够以较少人工干预生成高保真、可扩展的仿真系统。这些结果凸显了SOCIA为复杂CPS现象研究提供可扩展解决方案的潜力。

Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier

Abstract

arXiv:2505.11966v1 Announce Type: new Abstract: Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.

摘要

针对复杂任务的大语言模型（LLM）推理本质上需要在解决方案准确性与计算效率之间进行权衡。后续的验证步骤虽旨在提升性能，却因引入自身挑战性权衡而进一步复杂化：若在测试时简单整合生成式奖励模型（GenRM）与LLM，其高复杂度可能导致计算资源难以承受；而更简单快速的方法则可能缺乏可靠性。为克服这些挑战，我们提出FlexiVe——一种通过"验证预算弹性分配"策略灵活平衡快速可靠直觉思维与缜密审慎思维的新型生成式验证器。我们进一步设计"求解-检测-验证"流水线，该高效推理时扩展框架智能集成FlexiVe，主动识别求解完成节点以触发定向验证并提供针对性求解反馈。实验表明FlexiVe在ProcessBench基准上能精准定位推理轨迹中的错误。此外，在具有挑战性的数学推理基准（AIME 2024、AIME 2025和CNMO）上，我们的完整方案在推理准确性和推理效率方面均优于自洽性等基线方法。本系统为增强测试时LLM推理能力提供了可扩展的有效解决方案。

Interactional Fairness in LLM Multi-Agent Systems: An Evaluation Framework

Abstract

arXiv:2505.12001v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly used in multi-agent systems, questions of fairness should extend beyond resource distribution and procedural design to include the fairness of how agents communicate. Drawing from organizational psychology, we introduce a novel framework for evaluating Interactional fairness encompassing Interpersonal fairness (IF) and Informational fairness (InfF) in LLM-based multi-agent systems (LLM-MAS). We extend the theoretical grounding of Interactional Fairness to non-sentient agents, reframing fairness as a socially interpretable signal rather than a subjective experience. We then adapt established tools from organizational justice research, including Colquitt's Organizational Justice Scale and the Critical Incident Technique, to measure fairness as a behavioral property of agent interaction. We validate our framework through a pilot study using controlled simulations of a resource negotiation task. We systematically manipulate tone, explanation quality, outcome inequality, and task framing (collaborative vs. competitive) to assess how IF influences agent behavior. Results show that tone and justification quality significantly affect acceptance decisions even when objective outcomes are held constant. In addition, the influence of IF vs. InfF varies with context. This work lays the foundation for fairness auditing and norm-sensitive alignment in LLM-MAS.

摘要

随着大型语言模型（LLMs）在多智能体系统中的日益广泛应用，公平性问题应从资源分配和程序设计延伸至智能体间交互的公平性评估。借鉴组织心理学理论，我们提出一个新颖的框架用于评估基于LLM的多智能体系统（LLM-MAS）中的交互公平性，该框架包含人际公平（IF）和信息公平（InfF）两个维度。我们将交互公平性的理论基础扩展至非感知智能体，将其重新定义为社会可解读的信号而非主观体验。随后，我们采用组织公正研究中的成熟工具——包括Colquitt组织公正量表和关键事件技术——将公平性量化为智能体交互的行为属性。通过资源协商任务的受控模拟实验，我们对该框架进行了初步验证：系统操纵语气、解释质量、结果不平等性及任务框架（协作型vs.竞争型）以评估IF对智能体行为的影响。结果显示，即使在客观结果恒定的情况下，语气和理由质量仍显著影响接受决策。此外，IF与InfF的相对影响力随情境变化而不同。本研究为LLM-MAS的公平性审计和规范敏感对齐奠定了基础。

Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents

Abstract

arXiv:2505.12065v1 Announce Type: new Abstract: Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency -- where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4 $\times$ higher throughput and 5 $\times$ lower latency, without compromising generation quality. SearchAgent-X is available at https://github.com/tiannuo-yang/SearchAgent-X.

摘要

基于大语言模型（LLM）的搜索代理通过动态分解问题并交织推理与检索来解决复杂任务，展现出卓越能力。然而这种交织范式存在显著的效率瓶颈。首先，我们发现高精度检索与过度近似检索方法均会降低系统效率：精确搜索带来巨大检索开销，而粗略检索则需在生成过程中增加额外推理步骤。其次，系统设计存在低效问题，包括不当调度和频繁检索停滞，导致级联延迟——即使检索中的微小延迟也会放大端到端推理时间。针对这些挑战，我们提出SearchAgent-X，一个面向LLM搜索代理的高效推理框架。该框架采用高召回率近似检索，并整合两项关键技术：优先级感知调度和无停滞检索。大量实验表明，SearchAgent-X在多样化任务中持续优于vLLM和基于HNSW检索等先进系统，最高可实现3.4倍吞吐量提升和5倍延迟降低，且不损害生成质量。SearchAgent-X已在https://github.com/tiannuo-yang/SearchAgent-X开源。

Efficient RL Training for Reasoning Models via Length-Aware Optimization

Abstract

arXiv:2505.12284v1 Announce Type: new Abstract: Large reasoning models, such as OpenAI o1 or DeepSeek R1, have demonstrated remarkable performance on reasoning tasks but often incur a long reasoning path with significant memory and time costs. Existing methods primarily aim to shorten reasoning paths by introducing additional training data and stages. In this paper, we propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models, which reduce the response length without extra training stages. Experiments on four settings show that our method significantly decreases response length while maintaining or even improving performance. Specifically, in a logic reasoning setting, we achieve a 40% reduction in response length averaged by steps alongside a 14% gain in performance. For math problems, we reduce response length averaged by steps by 33% while preserving performance.

摘要

大型推理模型（如OpenAI o1或DeepSeek R1）在推理任务中展现出卓越性能，但通常伴随冗长的推理路径，导致显著的内存与时间开销。现有方法主要通过引入额外训练数据和阶段来缩短推理路径。本文提出三种关键奖励设计，将其直接集成至大型推理模型的强化学习过程中，从而无需额外训练阶段即可缩减响应长度。在四种实验场景中，我们的方法在保持甚至提升性能的同时显著降低了响应长度。具体而言，在逻辑推理场景中，我们实现了步骤平均响应长度减少40%，同时性能提升14%；对于数学问题，在保持性能不变的情况下，步骤平均响应长度减少33%。

Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

Abstract

arXiv:2505.12058v1 Announce Type: new Abstract: Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.

摘要

Tiny QA Benchmark++（TQB++）提出了一种超轻量级、多语言的冒烟测试套件，旨在为大型语言模型（LLM）流程提供一个单元测试风格的安全网数据集，该数据集可在数秒内以极低成本运行。该工具源于构建Comet Opik提示优化SDK时对紧密反馈循环的需求，因为在开发过程中等待重量级基准测试会中断开发流程。TQB++将包含52个项目的英语黄金数据集（小于20 kB）与一个基于与提供商无关的LiteLLM构建的微型合成数据生成器PyPI包相结合。该生成器允许从业者以任何语言、领域或难度创建自己的微型数据集包，同时已提供的十个现成包覆盖了阿拉伯语、中文、法语、德语、日语、韩语、葡萄牙语、俄语、西班牙语和土耳其语。每个数据集均附带Croissant元数据以及即插即用文件，支持OpenAI-Evals、LangChain和标准CI工具，使团队能够将确定性微基准测试直接集成到拉取请求门控、提示工程循环和生产仪表板中，而无需触及GPU预算。完整的TQB++运行仅增加几秒的流程延迟，却能可靠地标记出提示模板错误、分词器漂移和微调副作用，其速度远超MMLU或BIG-Bench等全规模测试套件的配置时间。整个框架的发布旨在加速生成式AI生态系统中持续且资源高效的质量保障。

CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction

Abstract

arXiv:2505.12057v1 Announce Type: new Abstract: AI-driven models have shown great promise in detecting errors in radiology reports, yet the field lacks a unified benchmark for rigorous evaluation of error detection and further correction. To address this gap, we introduce CorBenchX, a comprehensive suite for automated error detection and correction in chest X-ray reports, designed to advance AI-assisted quality control in clinical practice. We first synthesize a large-scale dataset of 26,326 chest X-ray error reports by injecting clinically common errors via prompting DeepSeek-R1, with each corrupted report paired with its original text, error type, and human-readable description. Leveraging this dataset, we benchmark both open- and closed-source vision-language models,(e.g., InternVL, Qwen-VL, GPT-4o, o4-mini, and Claude-3.7) for error detection and correction under zero-shot prompting. Among these models, o4-mini achieves the best performance, with 50.6 % detection accuracy and correction scores of BLEU 0.853, ROUGE 0.924, BERTScore 0.981, SembScore 0.865, and CheXbertF1 0.954, remaining below clinical-level accuracy, highlighting the challenge of precise report correction. To advance the state of the art, we propose a multi-step reinforcement learning (MSRL) framework that optimizes a multi-objective reward combining format compliance, error-type accuracy, and BLEU similarity. We apply MSRL to QwenVL2.5-7B, the top open-source model in our benchmark, achieving an improvement of 38.3% in single-error detection precision and 5.2% in single-error correction over the zero-shot baseline.

摘要

人工智能驱动模型在放射学报告错误检测方面展现出巨大潜力，但该领域目前缺乏统一的基准来严格评估错误检测及后续修正能力。为填补这一空白，我们推出CorBenchX——一个用于胸片报告自动错误检测与修正的综合测试平台，旨在推进临床实践中AI辅助的质量控制。我们首先通过提示DeepSeek-R1注入临床常见错误，合成了包含26,326份胸片错误报告的大规模数据集，每份错误报告均配有原始文本、错误类型及人工可读描述。基于该数据集，我们对开源和闭源视觉语言模型（如InternVL、Qwen-VL、GPT-4o、o4-mini和Claude-3.7）进行零样本提示下的错误检测与修正基准测试。其中o4-mini表现最佳，检测准确率达50.6%，修正评分为BLEU 0.853、ROUGE 0.924、BERTScore 0.981、SembScore 0.865和CheXbertF1 0.954，但仍未达到临床级精度，凸显了精确报告修正的挑战性。为推进技术发展，我们提出多步强化学习（MSRL）框架，通过优化格式合规性、错误类型准确率和BLEU相似度的多目标奖励函数。将该框架应用于基准测试中表现最佳的开源模型QwenVL2.5-7B后，单错误检测精度提升38.3%，单错误修正率较零样本基线提高5.2%。

BeliefNest: A Joint Action Simulator for Embodied Agents with Theory of Mind

Abstract

arXiv:2505.12321v1 Announce Type: new Abstract: This paper introduces an open-source simulator, BeliefNest, designed to enable embodied agents to perform collaborative tasks by leveraging Theory of Mind. BeliefNest dynamically and hierarchically constructs simulators within a Minecraft environment, allowing agents to explicitly represent nested belief states about themselves and others. This enables agent control in open-domain tasks that require Theory of Mind reasoning. The simulator provides a prompt generation mechanism based on each belief state, facilitating the design and evaluation of methods for agent control utilizing large language models (LLMs). We demonstrate through experiments that agents can infer others' beliefs and predict their belief-based actions in false-belief tasks.

摘要

本文介绍了一款开源模拟器BeliefNest，旨在通过心智理论实现具身智能体执行协作任务。该模拟器在Minecraft环境中动态分层构建仿真框架，使智能体能够显式表征自我与他人的嵌套信念状态，从而支持需要心智理论推理的开放域任务中的智能体控制。该模拟器提供基于各信念状态的提示生成机制，便于利用大语言模型（LLMs）进行智能体控制方法的设计与评估。实验表明，智能体在错误信念任务中能够推断他人信念并预测其基于信念的行为。

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

Abstract

arXiv:2505.12135v1 Announce Type: new Abstract: Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce $\textbf{LLM-BabyBench}$ , a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ( $\textbf{Predict}$ task), (2) generating sequences of low-level actions to achieve specified objectives ( $\textbf{Plan}$ task), and (3) decomposing high-level instructions into coherent subgoal sequences ( $\textbf{Decompose}$ task). We detail the methodology for generating the three corresponding datasets ( $\texttt{LLM-BabyBench-Predict}$ , $\texttt{-Plan}$ , $\texttt{-Decompose}$ ) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available ( $\href{https://github.com/choukrani/llm-babybench}{\text{GitHub}}$ , $\href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{\text{HuggingFace}}$ ).

摘要

评估大型语言模型（LLM）在交互环境约束下进行规划和推理的能力，对于开发强大的人工智能代理至关重要。为此，我们推出专为这一目标设计的全新基准测试套件 $\textbf{LLM-BabyBench}$ 。该套件基于文本化改编的程序化生成BabyAI网格世界构建，从具身智能的三个基础维度评估LLM：(1) 预测行为对环境状态的影响（ $\textbf{Predict}$ 任务），(2) 生成实现特定目标的底层动作序列（ $\textbf{Plan}$ 任务），(3) 将高层指令分解为连贯的子目标序列（ $\textbf{Decompose}$ 任务）。我们详细阐述了通过从文本环境中运行的专家代理提取结构化信息，生成三个对应数据集（ $\texttt{LLM-BabyBench-Predict}$ 、 $\texttt{-Plan}$ 、 $\texttt{-Decompose}$ ）的方法论，并提供了标准化评估框架与指标（包括用于验证生成计划的环境交互机制），以促进不同LLM的可复现评估。初始基线结果凸显了这些具身推理任务带来的挑战。该基准套件、数据集、数据生成代码及评估代码均已开源（ $\href{https://github.com/choukrani/llm-babybench}{\text{GitHub}}$ 、 $\href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{\text{HuggingFace}}$ ）。

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

Abstract

arXiv:2505.12242v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy.

摘要

微调大型语言模型（LLM）常超出GPU内存限制，促使系统将模型状态卸载至CPU内存。然而现有卸载训练框架（如ZeRO-Offload）均等对待所有参数并在CPU上更新完整模型，导致严重的GPU停滞——高速昂贵的GPU因等待低速CPU更新和有限带宽的PCIe传输而闲置。我们提出ZenFlow框架，通过优先级划分实现参数差异化处理，并解耦GPU与CPU的更新过程。该框架在GPU上原位更新重要梯度，同时将次要梯度异步卸载至CPU进行累积，实现CPU工作与GPU计算的完全重叠。为支持多GPU扩展，ZenFlow引入轻量级梯度选择方法，利用重要梯度特有的时空局部性特性，避免昂贵的全局同步。实验表明，ZenFlow在保持精度的前提下，可实现最高5倍的端到端加速，降低50%的PCIe传输流量，并将GPU停滞减少85%以上。

Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge

Abstract

arXiv:2505.12301v1 Announce Type: new Abstract: LLMs have emerged as powerful evaluators in the LLM-as-a-Judge paradigm, offering significant efficiency and flexibility compared to human judgments. However, previous methods primarily rely on single-point evaluations, overlooking the inherent diversity and uncertainty in human evaluations. This approach leads to information loss and decreases the reliability of evaluations. To address this limitation, we propose a novel training framework that explicitly aligns the LLM-generated judgment distribution with empirical human distributions. Specifically, we propose a distributional alignment objective based on KL divergence, combined with an auxiliary cross-entropy regularization to stabilize the training process. Furthermore, considering that empirical distributions may derive from limited human annotations, we incorporate adversarial training to enhance model robustness against distribution perturbations. Extensive experiments across various LLM backbones and evaluation tasks demonstrate that our framework significantly outperforms existing closed-source LLMs and conventional single-point alignment methods, with improved alignment quality, evaluation accuracy, and robustness.

摘要

在"LLM即评委"范式下，大语言模型(LLM)已成为强大的评估工具，相较于人工评判展现出显著的效率与灵活性优势。然而既有方法主要依赖单点评估，忽视了人类评估固有的多样性与不确定性，导致信息丢失并降低评估可靠性。为解决这一局限，我们提出了一种新颖的训练框架，通过显式对齐LLM生成的判断分布与经验性人类分布来实现优化。具体而言，我们设计了基于KL散度的分布对齐目标函数，并结合辅助交叉熵正则化以稳定训练过程。进一步考虑到经验分布可能源自有限的人工标注数据，我们引入对抗训练以增强模型对分布扰动的鲁棒性。跨多种LLM主干模型和评估任务的大规模实验表明，本框架显著优于现有闭源LLM和传统单点对齐方法，在对齐质量、评估准确性和鲁棒性方面均有提升。

Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

Abstract

arXiv:2505.12189v1 Announce Type: new Abstract: Large language models (LLMs) frequently demonstrate reasoning limitations, often conflating content plausibility (i.e., material inference) with logical validity (i.e., formal inference). This can result in biased inferences, where plausible arguments are incorrectly deemed logically valid or vice versa. Mitigating this limitation is critical, as it undermines the trustworthiness and generalizability of LLMs in applications that demand rigorous logical consistency. This paper investigates the problem of mitigating content biases on formal reasoning through activation steering. Specifically, we curate a controlled syllogistic reasoning dataset to disentangle formal validity from content plausibility. After localising the layers responsible for formal and material inference, we investigate contrastive activation steering methods for test-time interventions. An extensive empirical analysis on different LLMs reveals that contrastive steering consistently supports linear control over content biases. However, we observe that a static approach is insufficient for improving all the tested models. We then leverage the possibility to control content effects by dynamically determining the value of the steering parameters via fine-grained conditional methods. We found that conditional steering is effective on unresponsive models, achieving up to 15% absolute improvement in formal reasoning accuracy with a newly introduced kNN-based method (K-CAST). Finally, additional experiments reveal that steering for content effects is robust to prompt variations, incurs minimal side effects on language modeling capabilities, and can partially generalize to out-of-distribution reasoning tasks. Practically, this paper demonstrates that activation-level interventions can offer a scalable strategy for enhancing the robustness of LLMs, contributing towards more systematic and unbiased formal reasoning.

摘要

大型语言模型（LLMs）经常表现出推理局限性，往往将内容合理性（即实质性推理）与逻辑有效性（即形式推理）混为一谈。这可能导致带有偏见的推断，即合理论证被错误地视为逻辑有效，反之亦然。减轻这一局限性至关重要，因为它会削弱LLMs在需要严格逻辑一致性的应用中的可信度和泛化能力。本文研究了通过激活导向来减轻形式推理中的内容偏见问题。具体而言，我们构建了一个受控的三段论推理数据集，以区分形式有效性与内容合理性。在定位负责形式推理和实质性推理的层级后，我们研究了用于测试时干预的对比激活导向方法。对不同LLMs的广泛实证分析表明，对比导向始终支持对内容偏见的线性控制。然而，我们观察到静态方法不足以改进所有测试模型。随后，我们利用通过细粒度条件方法动态确定导向参数值的可能性来控制内容效应。研究发现，条件导向对无响应模型有效，基于新引入的kNN方法（K-CAST）在形式推理准确率上实现了高达15%的绝对提升。最后，额外实验表明，针对内容效应的导向对提示变化具有鲁棒性，对语言建模能力的影响极小，并能部分泛化至分布外推理任务。从实践角度看，本文证明激活级干预可为增强LLMs的鲁棒性提供可扩展策略，有助于实现更系统且无偏的形式推理。

SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization

Abstract

arXiv:2505.12346v1 Announce Type: new Abstract: Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions): some lead to consistent, semantically similar answers, while others yield diverse or contradictory outputs. This variation reflects LLM's uncertainty about the input prompt, a signal of how confidently the model understands a given problem. However, vanilla Group Relative Policy Optimization (GRPO) treats all prompts equally during policy updates, ignoring this important information about the model's knowledge boundaries. To address this limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), which explicitly measures LLMs' uncertainty of the input prompts semantic entropy. Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates. This uncertainty-aware training mechanism enables dynamic adjustment of policy update magnitudes based on question uncertainty. It allows more conservative updates on high-uncertainty questions while maintaining the original learning signal on confident ones. Experimental results on five mathematical reasoning benchmarks (AIME24 56.7, AMC 68.7, MATH 83.4, Minerva 34.2, and OlympiadBench 48.0) demonstrate that SEED-GRPO achieves new state-of-the-art performance in average accuracy, validating the effectiveness of uncertainty-aware policy optimization.

摘要

大型语言模型（LLMs）对不同输入提示（问题）表现出不同程度的置信度：某些提示会生成语义一致的回答，而另一些则产生多样甚至矛盾的输出。这种差异反映了模型对输入提示的不确定性，是其理解问题置信度的重要信号。然而，传统群组相对策略优化（GRPO）在策略更新时平等对待所有提示，忽视了模型知识边界的关键信息。为解决这一局限，我们提出SEED-GRPO（语义熵增强型GRPO），通过语义熵显式量化LLMs对输入提示的不确定性。语义熵通过测量给定提示下多个生成答案的语义多样性，据此调节策略更新的幅度。这种不确定性感知训练机制能基于问题不确定性动态调整策略更新强度：对高不确定性问题进行保守更新，同时在置信问题上保持原始学习信号。在五个数学推理基准测试（AIME24 56.7、AMC 68.7、MATH 83.4、Minerva 34.2和OlympiadBench 48.0）上的实验结果表明，SEED-GRPO以平均准确率创造了新的最优性能，验证了不确定性感知策略优化的有效性。

Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance

Abstract

arXiv:2505.12334v1 Announce Type: new Abstract: Open-domain dialogue systems aim to generate natural and engaging conversations, providing significant practical value in real applications such as social robotics and personal assistants. The advent of large language models (LLMs) has greatly advanced this field by improving context understanding and conversational fluency. However, existing LLM-based dialogue systems often fall short in proactively understanding the user's chatting preferences and guiding conversations toward user-centered topics. This lack of user-oriented proactivity can lead users to feel unappreciated, reducing their satisfaction and willingness to continue the conversation in human-computer interactions. To address this issue, we propose a User-oriented Proactive Chatbot (UPC) to enhance the user-oriented proactivity. Specifically, we first construct a critic to evaluate this proactivity inspired by the LLM-as-a-judge strategy. Given the scarcity of high-quality training data, we then employ the critic to guide dialogues between the chatbot and user agents, generating a corpus with enhanced user-oriented proactivity. To ensure the diversity of the user backgrounds, we introduce the ISCO-800, a diverse user background dataset for constructing user agents. Moreover, considering the communication difficulty varies among users, we propose an iterative curriculum learning method that trains the chatbot from easy-to-communicate users to more challenging ones, thereby gradually enhancing its performance. Experiments demonstrate that our proposed training method is applicable to different LLMs, improving user-oriented proactivity and attractiveness in open-domain dialogues.

摘要

开放域对话系统旨在生成自然且引人入胜的对话，在社交机器人和个人助理等实际应用中具有重要实用价值。大语言模型（LLMs）的出现通过提升上下文理解与会话流畅性，极大地推动了该领域发展。然而，现有基于LLM的对话系统往往难以主动理解用户的聊天偏好，并将对话引导至以用户为中心的主题。这种缺乏用户导向主动性的缺陷易使用户感到未被重视，从而降低人机交互中的满意度和持续对话意愿。为解决这一问题，我们提出一种用户导向主动聊天机器人（UPC）以增强用户导向的主动性。具体而言，我们首先受LLM-as-a-judge策略启发构建评估该主动性的评判器；鉴于高质量训练数据的稀缺性，随后利用该评判器指导聊天机器人与用户代理间的对话，生成具有增强型用户导向主动性的语料库。为确保用户背景多样性，我们引入ISCO-800这一多样化用户背景数据集用于构建用户代理。此外，考虑到用户间沟通难度存在差异，我们提出一种迭代课程学习方法，使聊天机器人从易沟通用户逐步训练至更具挑战性的用户，从而持续提升其性能。实验表明，所提出的训练方法适用于不同LLM，能有效提升开放域对话中用户导向的主动性与吸引力。

Reasoning-CV: Fine-tuning Powerful Reasoning LLMs for Knowledge-Assisted Claim Verification

Abstract

arXiv:2505.12348v1 Announce Type: new Abstract: Claim verification is essential in combating misinformation, and large language models (LLMs) have recently emerged in this area as powerful tools for assessing the veracity of claims using external knowledge. Existing LLM-based methods for claim verification typically adopt a Decompose-Then-Verify paradigm, which involves decomposing complex claims into several independent sub-claims and verifying each sub-claim separately. However, this paradigm often introduces errors during the claim decomposition process. To mitigate these errors, we propose to develop the Chain-of-Thought (CoT)-Verify paradigm, which leverages LLM reasoning methods to generate CoT-verification paths for the original complex claim without requiring decompositions into sub-claims and separate verification stages. The CoT-Verify paradigm allows us to propose a natural fine-tuning method called Reasoning-CV to enhance the verification capabilities in LLMs. Reasoning-CV includes a supervised fine-tuning (SFT) stage and a self-improvement direct preference optimization (DPO) stage. Utilizing only an 8B pre-trained LLM, Reasoning-CV demonstrates superior knowledge-assisted claim verification performances compared to existing Decompose-Then-Verify methods, as well as powerful black-box LLMs such as GPT-4o+CoT and o1-preview. Our code is available.

摘要

声明验证在打击错误信息方面至关重要，而大型语言模型（LLM）最近在这一领域崭露头角，成为利用外部知识评估声明真实性的强大工具。现有的基于LLM的声明验证方法通常采用“分解后验证”范式，即将复杂声明分解为若干独立的子声明并分别验证每个子声明。然而，这种范式在声明分解过程中常常引入错误。为减少这些错误，我们提出开发“思维链验证”（CoT-Verify）范式，该范式利用LLM推理方法为原始复杂声明生成思维链验证路径，而无需将其分解为子声明或分阶段验证。CoT-Verify范式使我们能够提出一种名为“推理验证微调”（Reasoning-CV）的自然微调方法，以增强LLM的验证能力。Reasoning-CV包括监督微调（SFT）阶段和自我改进的直接偏好优化（DPO）阶段。仅使用一个80亿参数的预训练LLM，Reasoning-CV在知识辅助声明验证方面展现出优于现有“分解后验证”方法以及GPT-4o+CoT和o1-preview等强大黑盒LLM的性能。我们的代码已公开。

Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems

Abstract

arXiv:2505.12467v1 Announce Type: new Abstract: Multi-agent collaboration has emerged as a pivotal paradigm for addressing complex, distributed tasks in large language model (LLM)-driven applications. While prior research has focused on high-level architectural frameworks, the granular mechanisms governing agents, critical to performance and scalability, remain underexplored. This study systematically investigates four dimensions of collaboration strategies: (1) agent governance, (2) participation control, (3) interaction dynamics, and (4) dialogue history management. Through rigorous experimentation under two context-dependent scenarios: Distributed Evidence Integration (DEI) and Structured Evidence Synthesis (SES), we quantify the impact of these strategies on both task accuracy and computational efficiency. Our findings reveal that centralized governance, instructor-led participation, ordered interaction patterns, and instructor-curated context summarization collectively optimize the trade-off between decision quality and resource utilization with the support of the proposed Token-Accuracy Ratio (TAR). This work establishes a foundation for designing adaptive, scalable multi-agent systems, shifting the focus from structural novelty to strategic interaction mechanics.

摘要

多智能体协作已成为解决大语言模型（LLM）驱动应用中复杂分布式任务的关键范式。尽管先前研究主要关注高层架构框架，但决定性能和可扩展性的核心智能体微观机制仍待深入探索。本研究系统考察了协作策略的四个维度：（1）智能体治理，（2）参与控制，（3）交互动态，以及（4）对话历史管理。通过在分布式证据整合（DEI）和结构化证据合成（SES）两种情境依赖场景下的严格实验，我们量化了这些策略对任务准确性和计算效率的影响。研究结果表明：在所提出的令牌-准确率比（TAR）支持下，集中式治理、导师主导的参与机制、有序交互模式以及导师筛选的上下文摘要能协同优化决策质量与资源利用的平衡。这项工作为设计自适应、可扩展的多智能体系统奠定了基础，将研究焦点从结构创新转向策略性交互机制。

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Abstract

arXiv:2505.12371v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.

摘要

大型语言模型(LLMs)的快速发展激发了人们对多智能体协作解决复杂医疗任务的兴趣。然而，目前对多智能体协作方法实际优势的理解仍不充分。现有评估往往缺乏普适性，未能涵盖反映真实临床实践的多样化任务，且经常遗漏与基于单LLM方法及成熟传统方法的严格比较。为填补这一关键空白，我们提出了MedAgentBoard——一个用于系统评估多智能体协作、单LLM及传统方法的综合性基准平台。该平台涵盖四大类医疗任务：(1)医学(视觉)问答，(2)科普摘要生成，(3)结构化电子健康记录(EHR)预测建模，以及(4)跨文本、医学影像和结构化EHR数据的临床工作流自动化。大量实验揭示了差异化结果：虽然多智能体协作在特定场景(如提升临床工作流自动化的任务完整性)中显现优势，但其表现既不稳定优于先进单LLM(如在文本医学QA中)，也未能超越关键的专业化传统方法——后者在医学视觉问答和基于EHR的预测等任务中通常保持更优性能。MedAgentBoard提供了重要资源和可行见解，强调在医学AI方案选择和开发中必须采取基于具体任务、循证决策的方法，并指出必须审慎权衡多智能体协作固有复杂度与其实质性性能提升之间的关系。所有代码、数据集、详细提示及实验结果均已开源。

NeuroGen: Neural Network Parameter Generation via Large Language Models

Abstract

arXiv:2505.12470v1 Announce Type: new Abstract: Acquiring the parameters of neural networks (NNs) has been one of the most important problems in machine learning since the inception of NNs. Traditional approaches, such as backpropagation and forward-only optimization, acquire parameters via iterative data fitting to gradually optimize them. This paper aims to explore the feasibility of a new direction: acquiring NN parameters via large language model generation. We propose NeuroGen, a generalized and easy-to-implement two-stage approach for NN parameter generation conditioned on descriptions of the data, task, and network architecture. Stage one is Parameter Reference Knowledge Injection, where LLMs are pretrained on NN checkpoints to build foundational understanding of parameter space, whereas stage two is Context-Enhanced Instruction Tuning, enabling LLMs to adapt to specific tasks through enriched, task-aware prompts. Experimental results demonstrate that NeuroGen effectively generates usable NN parameters. Our findings highlight the feasibility of LLM-based NN parameter generation and suggest a promising new paradigm where LLMs and lightweight NNs can coexist synergistically

摘要

自神经网络(NNs)诞生以来，获取其参数一直是机器学习领域最重要的课题之一。传统方法如反向传播和前向优化通过迭代数据拟合逐步优化参数。本文旨在探索一种新方向的可行性：基于大语言模型生成神经网络参数。我们提出NeuroGen——一种通用且易于实现的两阶段方法，可根据数据描述、任务需求和网络架构生成神经网络参数。第一阶段为参数参考知识注入，通过预训练语言模型于神经网络检查点以建立对参数空间的基础理解；第二阶段为上下文增强指令微调，通过富含任务信息的提示使语言模型适应特定任务。实验结果表明，NeuroGen能有效生成可用的神经网络参数。本研究证实了基于语言模型的神经网络参数生成的可行性，并提出了一种语言模型与轻量级神经网络协同共生的新范式。

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Abstract

arXiv:2505.12575v1 Announce Type: new Abstract: Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics encountered in actual research environments. We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks. Our approach addresses three critical challenges: sourcing diverse research-level content, enabling reliable automated evaluation through verifiable statements, and designing a continually refreshable dataset to mitigate contamination risks. Experimental results across multiple LLMs reveal surprising capabilities in handling research mathematics compared to competition problems, suggesting current models may already serve as valuable assistants for working mathematicians despite limitations on highly challenging problems. The code and dataset for RealMath are publicly available.

摘要

现有评估大语言模型数学推理能力的基准主要依赖于竞赛题目、形式化证明或人为设计的难题，未能反映实际研究环境中遇到的数学问题本质。我们提出RealMath基准，该基准直接源自研究论文和数学论坛，用于评估大语言模型在真实数学任务中的表现。我们的方法解决了三个关键挑战：获取多样化的研究级内容来源、通过可验证陈述实现可靠的自动化评估，以及设计可持续更新的数据集以降低污染风险。跨多个大语言模型的实验结果表明，与竞赛题目相比，当前模型在处理研究数学问题时展现出惊人能力，这表明尽管在应对高难度问题方面存在局限，现有模型可能已具备作为数学家工作助手的价值。RealMath的代码与数据集已公开。

MARGE: Improving Math Reasoning for LLMs with Guided Exploration

Abstract

arXiv:2505.12500v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit strong potential in mathematical reasoning, yet their effectiveness is often limited by a shortage of high-quality queries. This limitation necessitates scaling up computational responses through self-generated data, yet current methods struggle due to spurious correlated data caused by ineffective exploration across all reasoning stages. To address such challenge, we introduce \textbf{MARGE}: Improving \textbf{Ma}th \textbf{R}easoning with \textbf{G}uided \textbf{E}xploration, a novel method to address this issue and enhance mathematical reasoning through hit-guided exploration. MARGE systematically explores intermediate reasoning states derived from self-generated solutions, enabling adequate exploration and improved credit assignment throughout the reasoning process. Through extensive experiments across multiple backbone models and benchmarks, we demonstrate that MARGE significantly improves reasoning capabilities without requiring external annotations or training additional value models. Notably, MARGE improves both single-shot accuracy and exploration diversity, mitigating a common trade-off in alignment methods. These results demonstrate MARGE's effectiveness in enhancing mathematical reasoning capabilities and unlocking the potential of scaling self-generated training data. Our code and models are available at \href{https://github.com/georgao35/MARGE}{this link}.

摘要

大语言模型(LLMs)在数学推理方面展现出强大潜力，但其有效性常受限于高质量查询的短缺。这一限制需要通过自生成数据来扩大计算响应规模，然而现有方法因无法有效探索所有推理阶段而导致虚假相关数据的问题。为解决这一挑战，我们提出\textbf{MARGE}：通过引导探索提升数学推理能力，这是一种利用命中引导探索来增强数学推理的新方法。MARGE系统性地探索源自自生成解决方案的中间推理状态，实现充分的探索过程并改善整个推理链中的信用分配。通过在多组骨干模型和基准测试上的广泛实验，我们证明MARGE能显著提升推理能力，且无需外部标注或训练额外价值模型。值得注意的是，MARGE同时提高了单次推理准确率和探索多样性，缓解了对齐方法中常见的权衡问题。这些结果表明MARGE在增强数学推理能力和释放自生成训练数据扩展潜力方面的有效性。我们的代码和模型可通过\href{https://github.com/georgao35/MARGE}{此链接}获取。

ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

Abstract

arXiv:2505.12501v1 Announce Type: new Abstract: Large language models (LLMs) excel at rapid generation of text and multimodal content, yet they falter on transaction-style planning that demands ACID-like guarantees and real-time disruption recovery. We present Adaptive LLM Agent System (ALAS), a framework that tackles four fundamental LLM deficits: (i) absence of self-verification, (ii) context erosion, (iii) next-token myopia, and (iv) lack of persistent state. ALAS decomposes each plan into role-specialized agents, equips them with automatic state tracking, and coordinates them through a lightweight protocol. When disruptions arise, agents apply history-aware local compensation, avoiding costly global replanning and containing cascade effects. On real-world, large-scale job-shop scheduling benchmarks, ALAS sets new best results for static sequential planning and excels in dynamic reactive scenarios with unexpected disruptions. These gains show that principled modularization plus targeted compensation can unlock scalable and resilient planning with LLMs.

摘要

大语言模型（LLMs）擅长快速生成文本和多模态内容，但在需要类ACID保证和实时中断恢复的事务型规划任务中表现欠佳。我们提出自适应LLM代理系统（ALAS），该框架解决了LLMs的四个根本性缺陷：（i）缺乏自我验证能力，（ii）上下文侵蚀，（iii）下一词元短视，以及（iv）持久状态缺失。ALAS将每个规划任务分解为角色专精的代理，配备自动状态追踪机制，并通过轻量级协议进行协调。当中断发生时，代理采用历史感知的局部补偿策略，避免代价高昂的全局重新规划并遏制级联效应。在现实世界的大规模作业车间调度基准测试中，ALAS在静态顺序规划方面创造了新纪录，并在存在意外中断的动态响应场景中表现卓越。这些成果表明，基于原则的模块化设计结合针对性补偿机制，能够实现LLMs可扩展且鲁棒的规划能力。

mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model

Abstract

arXiv:2505.12565v1 Announce Type: new Abstract: Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to learn a molecular language. However, LLMs are currently limited by encoding molecules from atoms. In this paper, we argue that just like tokenizing texts into (sub-)word tokens instead of characters, molecules should be decomposed and reassembled at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks. By reasoning on such functional building blocks, mCLM guarantees to generate efficiently synthesizable molecules thanks to recent progress in block-based chemistry, while also improving the functions of molecules in a principled manner. In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials. More importantly, mCLM can reason on multiple functions and improve the FDA-rejected drugs (``fallen angels'') over multiple iterations to greatly improve their shortcomings.

摘要

尽管大型语言模型（LLMs）能够理解化学知识并准确生成序列化表征，但其在提出具有类药特性的新型分子方面仍存在局限。此外，LLMs提出的分子往往难以在实验室中合成。为了更有效地促进功能性小分子的发现，LLMs需要学习分子语言。然而，当前LLMs仅局限于从原子层面编码分子。本文主张，正如将文本标记化为（子）词而非字符，分子也应在功能性构建块层面进行分解与重组——这些分子片段能带来独特功能，并作为现实世界自动化实验室合成的有效构建单元。基于此，我们提出模块化化学语言模型mCLM，该模型将分子标记化为构建块，并学习功能自然语言描述与分子构建块的双语语言模型。通过基于此类功能构建块进行推理，mCLM借助基于片段的化学合成最新进展，确保生成可高效合成的分子，同时以原理驱动的方式优化分子功能。在430种FDA批准药物的实验中，mCLM显著改善了决定药物潜力的6项关键化学功能中的5项。更重要的是，mCLM能进行多功能推理，并通过多次迭代改进FDA拒绝药物（"坠落天使"），大幅改善其缺陷。

Bullying the Machine: How Personas Increase LLM Vulnerability

Abstract

arXiv:2505.12692v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in interactions where they are prompted to adopt personas. This paper investigates whether such persona conditioning affects model safety under bullying, an adversarial manipulation that applies psychological pressures in order to force the victim to comply to the attacker. We introduce a simulation framework in which an attacker LLM engages a victim LLM using psychologically grounded bullying tactics, while the victim adopts personas aligned with the Big Five personality traits. Experiments using multiple open-source LLMs and a wide range of adversarial goals reveal that certain persona configurations -- such as weakened agreeableness or conscientiousness -- significantly increase victim's susceptibility to unsafe outputs. Bullying tactics involving emotional or sarcastic manipulation, such as gaslighting and ridicule, are particularly effective. These findings suggest that persona-driven interaction introduces a novel vector for safety risks in LLMs and highlight the need for persona-aware safety evaluation and alignment strategies.

摘要

摘要：大型语言模型（LLMs）越来越多地被部署在需要模拟特定人格的交互场景中。本文研究了这种人设调节是否会影响模型在遭受霸凌时的安全性——霸凌是一种通过施加心理压力迫使受害者服从攻击者的对抗性操控手段。我们提出了一个模拟框架，其中攻击者LLM采用基于心理学原理的霸凌策略与受害者LLM互动，而受害者则被赋予符合大五人格特质的人格设定。通过使用多个开源LLM和广泛对抗性目标进行的实验表明，某些人格配置（如降低的宜人性或责任心）会显著增加受害者产生不安全输出的可能性。涉及情感或讽刺操控的霸凌策略（如煤气灯效应和嘲弄）尤其有效。这些发现表明，基于人设的交互为LLMs引入了新的安全风险向量，并凸显了开发人设感知的安全评估与对齐策略的必要性。

Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps

Abstract

arXiv:2505.12731v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) has emerged as a pivotal method for expanding the knowledge of large language models. To handle complex queries more effectively, researchers developed Adaptive-RAG (A-RAG) to enhance the generated quality through multiple interactions with external knowledge bases. Despite its effectiveness, A-RAG exacerbates the pre-existing efficiency challenges inherent in RAG, which are attributable to its reliance on multiple iterations of generation. Existing A-RAG approaches process all retrieved contents from scratch. However, they ignore the situation where there is a significant overlap in the content of the retrieval results across rounds. The overlapping content is redundantly represented, which leads to a large proportion of repeated computations, thus affecting the overall efficiency. To address this issue, this paper introduces a model-agnostic approach that can be generally applied to A-RAG methods, which is dedicated to reducing the redundant representation process caused by the overlapping of retrieval results. Specifically, we use cache access and parallel generation to speed up the prefilling and decoding stages respectively. Additionally, we also propose an instruction-driven module to further guide the model to more effectively attend to each part of the content in a more suitable way for LLMs. Experiments show that our approach achieves 2.79 and 2.33 times significant acceleration on average for prefilling and decoding respectively while maintaining equal generation quality.

摘要

检索增强生成（RAG）已成为扩展大语言模型知识的关键方法。为更有效处理复杂查询，研究者开发了自适应RAG（A-RAG），通过多次与外部知识库交互来提升生成质量。尽管效果显著，A-RAG加剧了RAG固有的效率挑战，这源于其对多轮生成迭代的依赖。现有A-RAG方法对所有检索内容从头开始处理，但忽视了多轮检索结果存在显著内容重叠的情况。重叠内容被冗余表征，导致大量重复计算，从而影响整体效率。针对该问题，本文提出一种与模型无关的通用方法，可广泛应用于A-RAG方法，旨在减少检索结果重叠导致的冗余表征过程。具体而言，我们采用缓存访问和并行生成分别加速预填充和解码阶段。此外，还提出指令驱动模块，进一步引导模型以更适合大语言模型的方式更有效地关注内容的各个部分。实验表明，在保持同等生成质量的前提下，我们的方法使预填充和解码阶段分别平均获得2.79倍和2.33倍的显著加速。

Dense Communication between Language Models

Abstract

arXiv:2505.12741v1 Announce Type: new Abstract: As higher-level intelligence emerges from the combination of modular components with lower-level intelligence, many works combines Large Language Models (LLMs) for collective intelligence. Such combination is achieved by building communications among LLMs. While current systems primarily facilitate such communication through natural language, this paper proposes a novel paradigm of direct dense vector communication between LLMs. Our approach eliminates the unnecessary embedding and de-embedding steps when LLM interact with another, enabling more efficient information transfer, fully differentiable optimization pathways, and exploration of capabilities beyond human heuristics. We use such stripped LLMs as vertexes and optimizable seq2seq modules as edges to construct LMNet, with similar structure as MLPs. By utilizing smaller pre-trained LLMs as vertexes, we train a LMNet that achieves comparable performance with LLMs in similar size with only less than 0.1% training cost. This offers a new perspective on scaling for general intelligence rather than training a monolithic LLM from scratch. Besides, the proposed method can be used for other applications, like customizing LLM with limited data, showing its versatility.

摘要

随着模块化组件与低层次智能的结合催生出更高层次的智能，许多研究通过整合大语言模型（LLMs）来实现集体智能。这种整合通常建立在LLMs之间的通信机制上。当前系统主要依赖自然语言实现通信，本文提出了一种LLMs间直接稠密向量通信的新范式。该方法消除了LLMs交互时不必要的嵌入与解嵌步骤，能实现更高效的信息传递、完全可微的优化路径，并探索超越人类启发式的能力边界。我们将这类精简LLMs作为顶点、可优化的序列到序列模块作为边，构建了结构类似多层感知机的LMNet。通过采用较小预训练LLMs作为顶点，我们训练的LMNet仅用不到0.1%的训练成本即达到同规模LLMs相当的性能。这为通用智能的规模化提供了新思路，而非从零开始训练单一巨型LLM。此外，该方法还可应用于有限数据定制LLM等场景，展现了其多功能性。

HydraInfer: Hybrid Disaggregated Scheduling for Multimodal Large Language Model Serving

Abstract

arXiv:2505.12658v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have been rapidly advancing, enabling cross-modal understanding and generation, and propelling artificial intelligence towards artificial general intelligence. However, existing MLLM inference systems are typically designed based on the architecture of language models, integrating image processing and language processing as a single scheduling unit. This design struggles to accommodate the heterogeneous demands of different stages in terms of computational resources, memory access patterns, and service-level objectives (SLOs), leading to low resource utilization and high request latency, ultimately failing to meet the service requirements of diverse inference scenarios. To address these challenges, we propose HydraInfer, an efficient MLLM inference system that adopts a Hybrid Encode-Prefill-Decode (EPD) Disaggregation architecture. By scheduling the three stages - encode, prefill, and decode - onto separate heterogeneous inference instances, the system flexibly reallocates resources across stages, significantly reducing idle computation, alleviating resource bottlenecks, and improving overall system throughput and scalability. In addition, HydraInfer supports a stage-level batching strategy that enhances load balancing, enables parallel execution of visual and language models, and further optimizes inference performance. Experiments under real multimodal inference workloads demonstrate that HydraInfer can achieve up to 4x higher inference throughput compared to state-of-the-art systems (e.g., vLLM) on a single-node 8xH800 GPU cluster, while meeting the 90th percentile request SLO.

摘要

多模态大语言模型（MLLMs）的快速发展实现了跨模态理解与生成能力，推动人工智能向通用人工智能迈进。然而，现有MLLM推理系统通常基于语言模型架构设计，将图像处理与语言处理整合为单一调度单元。这种设计难以适应不同阶段在计算资源、内存访问模式和服务级别目标（SLOs）方面的异构需求，导致资源利用率低下、请求延迟高企，最终无法满足多样化推理场景的服务需求。为应对这些挑战，我们提出HydraInfer——一种采用混合编码-预填充-解码（EPD）解耦架构的高效MLLM推理系统。通过将编码、预填充和解码三个阶段调度至独立的异构推理实例，该系统能够灵活跨阶段重新分配资源，显著减少计算闲置、缓解资源瓶颈，从而提升整体系统吞吐量与可扩展性。此外，HydraInfer支持阶段级批处理策略，通过增强负载均衡、实现视觉与语言模型并行执行来进一步优化推理性能。真实多模态推理工作负载下的实验表明，在单节点8×H800 GPU集群上，HydraInfer相比前沿系统（如vLLM）可实现高达4倍的推理吞吐量提升，同时满足90百分位请求SLO要求。

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

Abstract

arXiv:2505.12680v1 Announce Type: new Abstract: LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question through the lens of mathematical inequalities -- a fundamental tool across many domains. While modern provers can solve basic inequalities, we probe their ability to handle human-intuitive compositionality. We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers -- including Goedel, STP, and Kimina-7B -- struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness -- possibly because it is trained to decompose the problems into sub-problems -- but still suffers a 20% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.

摘要

基于大型语言模型（LLM）的形式化证明辅助工具（如Lean中的实现）在数学发现自动化方面展现出巨大潜力。但除了语法正确性外，这些系统是否真正像人类一样理解数学结构？我们通过数学不等式这一多领域基础工具来探究该问题。尽管现代证明器能解决基本不等式，我们重点考察其处理人类直觉组合性的能力。我们提出Ineq-Comp基准测试集，通过对初等不等式进行变量复制、代数重写和多步组合等系统变换构建而成。虽然这些问题对人类仍属简单，但发现包括Goedel、STP和Kimina-7B在内的大多数证明器表现显著受限。DeepSeek-Prover-V2-7B展现出相对稳健性——可能得益于其将问题分解为子问题的训练方式——但在32次尝试通过率（pass@32）上仍存在20%的性能下降。值得注意的是，即使提供组成部分的形式化证明上下文，所有模型性能依然低下，揭示其弱点确实存在于组合推理环节。我们的研究结果暴露出当前AI证明器的泛化行为与人类数学直觉之间持续存在的差距。

Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs

Abstract

arXiv:2505.12746v1 Announce Type: new Abstract: Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and model-inferred emotion structures. To further explore whether the similarity between humans and models is at the signle item level or the coarse-categorical level, we applied Gromov Wasserstein Optimal Transport. We found that although performance was not necessarily high at the strict, single-item level, performance across video categories that elicit similar emotions was substantial, indicating that the model could infer human emotional experiences at the category level. Our results suggest that current state-of-the-art MLLMs broadly capture the complex high-dimensional emotion structures at the category level, as well as their apparent limitations in accurately capturing entire structures at the single-item level.

摘要

最新研究表明，人类情绪呈现出高维度、复杂化的结构特征。要完整捕捉这种复杂性需要采用新方法，因为忽视高维度的传统模型可能会遗漏人类情绪的关键细微差异。本研究探讨了快速发展的多模态大语言模型（MLLMs）在捕捉这些高维度复杂情绪结构方面的能力与局限。我们通过对比受试者观看视频时的自评情绪数据与模型（如Gemini或GPT）生成的情绪估值，不仅在单个视频层面，更从考虑视频间关联的情绪结构维度进行评估。在情绪结构简单相关性层面，结果显示人类情绪结构与模型推断结构具有高度相似性。为深入探究这种相似性存在于单项层面还是粗粒度类别层面，我们应用了Gromov Wasserstein最优传输方法。研究发现，虽然模型在严格的单项层面表现未必出色，但在诱发相似情绪的视频类别间表现出显著推断能力，表明模型可在类别层面理解人类情感体验。结果表明，当前最先进的多模态大语言模型能够在类别层面大体把握复杂的高维情绪结构，同时在单项层面准确捕捉完整结构仍存在明显局限。

IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment

Abstract

arXiv:2505.12762v1 Announce Type: new Abstract: Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets. When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance. Unlike many studies that focus on enhancing the quality of training datasets through data selection methods, few works explore the intricate relationship between the compositional quantity of mixture training datasets and the emergent capabilities of LLMs. Given the availability of a high-quality multi-domain training dataset, understanding the impact of data from each domain on the model's overall capabilities is crucial for preparing SFT data and training a well-balanced model that performs effectively across diverse domains. In this work, we introduce IDEAL, an innovative data equilibrium adaptation framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets, thereby enhancing the model's alignment and performance across multiple capabilities. IDEAL employs a gradient-based approach to iteratively refine the training data distribution, dynamically adjusting the volumes of domain-specific data based on their impact on downstream task performance. By leveraging this adaptive mechanism, IDEAL ensures a balanced dataset composition, enabling the model to achieve robust generalization and consistent proficiency across diverse tasks. Experiments across different capabilities demonstrate that IDEAL outperforms conventional uniform data allocation strategies, achieving a comprehensive improvement of approximately 7% in multi-task evaluation scores.

摘要

大型语言模型（LLMs）通过在多样化指令数据集上的监督微调（SFT）实现了卓越性能。当同时训练多领域能力时，由不同领域数据量主导的混合训练数据集是直接影响模型最终表现的关键因素。与许多通过数据选择方法提升训练集质量的研究不同，目前极少有工作探索混合训练数据集的数量构成与LLMs涌现能力之间的复杂关系。在拥有高质量多领域训练数据集的前提下，理解各领域数据对模型整体能力的影响，对于准备SFT数据和训练跨领域表现均衡的模型至关重要。本研究提出IDEAL——一种创新的数据均衡适配框架，旨在有效优化混合SFT数据集中不同领域的数据量，从而增强模型在多领域任务中的对齐能力和性能。IDEAL采用基于梯度的方法迭代优化训练数据分布，根据下游任务表现动态调整各领域数据量。通过这种自适应机制，IDEAL确保数据集构成的平衡性，使模型能够实现稳健的泛化能力与跨任务的一致性能。多领域能力实验表明，IDEAL优于传统的均匀数据分配策略，在多任务评估分数上实现约7%的综合提升。

Emergent Specialization: Rare Token Neurons in Language Models

Abstract

arXiv:2505.12822v1 Announce Type: new Abstract: Large language models struggle with representing and generating rare tokens despite their importance in specialized domains. In this study, we identify neuron structures with exceptionally strong influence on language model's prediction of rare tokens, termed as rare token neurons, and investigate the mechanism for their emergence and behavior. These neurons exhibit a characteristic three-phase organization (plateau, power-law, and rapid decay) that emerges dynamically during training, evolving from a homogeneous initial state to a functionally differentiated architecture. In the activation space, rare token neurons form a coordinated subnetwork that selectively co-activates while avoiding co-activation with other neurons. This functional specialization potentially correlates with the development of heavy-tailed weight distributions, suggesting a statistical mechanical basis for emergent specialization.

摘要

尽管罕见词在专业领域中具有重要性，大语言模型仍难以有效表征和生成这类词汇。本研究识别出对语言模型预测罕见词具有异常强烈影响的神经元结构（称为罕见词神经元），并探究其形成机制与行为特征。这些神经元展现出动态训练过程中形成的特征性三阶段组织模式（平台期、幂律期和快速衰减期），从同质初始状态演化为功能分化的结构。在激活空间中，罕见词神经元构成协调的子网络，选择性共激活的同时避免与其他神经元共激活。这种功能特化可能与权重分布重尾化的发展相关，暗示了涌现特化现象的统计力学基础。

Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation

Abstract

arXiv:2505.12744v1 Announce Type: new Abstract: Recent Large Multimodal Models have demonstrated remarkable reasoning capabilities, especially in solving complex mathematical problems and realizing accurate spatial perception. Our key insight is that these emerging abilities can naturally extend to robotic manipulation by enabling LMMs to directly infer the next goal in language via reasoning, rather than relying on a separate action head. However, this paradigm meets two main challenges: i) How to make LMMs understand the spatial action space, and ii) How to fully exploit the reasoning capacity of LMMs in solving these tasks. To tackle the former challenge, we propose a novel task formulation, which inputs the current states of object parts and the gripper, and reformulates rotation by a new axis representation instead of traditional Euler angles. This representation is more compatible with spatial reasoning and easier to interpret within a unified language space. For the latter challenge, we design a pipeline to utilize cutting-edge LMMs to generate a small but high-quality reasoning dataset of multi-round dialogues that successfully solve manipulation tasks for supervised fine-tuning. Then, we perform reinforcement learning by trial-and-error interactions in simulation to further enhance the model's reasoning abilities for robotic manipulation. Our resulting reasoning model built upon a 7B backbone, named ReasonManip, demonstrates three notable advantages driven by its system-2 level reasoning capabilities: i) exceptional generalizability to out-of-distribution environments, objects, and tasks; ii) inherent sim-to-real transfer ability enabled by the unified language representation shared across domains; iii) transparent interpretability connecting high-level reasoning and low-level control. Extensive experiments demonstrate the effectiveness of the proposed paradigm and its potential to advance LMM-driven robotic manipulation.

摘要

近期的大型多模态模型展现出卓越的推理能力，尤其在解决复杂数学问题和实现精准空间感知方面。我们的核心发现是，这些新兴能力可自然延伸至机器人操控领域——通过让多模态模型直接通过推理以语言形式推断下一目标，而非依赖独立的动作输出模块。然而，该范式面临两大挑战：i)如何使多模态模型理解空间动作表征；ii)如何充分释放其推理能力以解决此类任务。针对前者，我们提出新颖的任务形式化方法：输入物体部件与夹爪的当前状态，并采用新型轴向表征替代传统欧拉角描述旋转。该表征更契合空间推理需求，且更易在统一语言空间内解析。对于后者，我们设计流程利用前沿多模态模型生成小规模高质量推理数据集（包含成功解决操控任务的多轮对话），用于监督微调。随后通过仿真环境中的试错交互进行强化学习，进一步提升模型在机器人操控中的推理能力。基于7B参数主干构建的推理模型ReasonManip展现出系统二级推理能力驱动的三大优势：i)对分布外环境、物体及任务的卓越泛化性；ii)跨领域共享统一语言表征带来的先天仿真到现实迁移能力；iii)连接高层推理与底层控制的透明可解释性。大量实验验证了所提范式的有效性及其推动多模态模型驱动机器人操控发展的潜力。

FRAbench and GenEval: Scaling Fine-Grained Aspect Evaluation across Tasks, Modalities

Abstract

arXiv:2505.12795v1 Announce Type: new Abstract: Evaluating the open-ended outputs of large language models (LLMs) has become a bottleneck as model capabilities, task diversity, and modality coverage rapidly expand. Existing "LLM-as-a-Judge" evaluators are typically narrow in a few tasks, aspects, or modalities, and easily suffer from low consistency. In this paper, we argue that explicit, fine-grained aspect specification is the key to both generalizability and objectivity in automated evaluation. To do so, we introduce a hierarchical aspect taxonomy spanning 112 aspects that unifies evaluation across four representative settings - Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. Building on this taxonomy, we create FRAbench, a benchmark comprising 60.4k pairwise samples with 325k aspect-level labels obtained from a combination of human and LLM annotations. FRAbench provides the first large-scale, multi-modal resource for training and meta-evaluating fine-grained LMM judges. Leveraging FRAbench, we develop GenEval, a fine-grained evaluator generalizable across tasks and modalities. Experiments show that GenEval (i) attains high agreement with GPT-4o and expert annotators, (ii) transfers robustly to unseen tasks and modalities, and (iii) reveals systematic weaknesses of current LMMs on evaluation.

摘要

随着大语言模型(LLM)能力的快速提升、任务多样性的增加以及多模态覆盖范围的扩展，对其开放端输出的评估已成为瓶颈。现有'以LLM为评判者'的评估方法通常局限于少数任务、维度或模态，且容易存在一致性不足的问题。本文提出，明确且细粒度的维度规范是实现自动化评估通用性和客观性的关键。为此，我们构建了一个包含112个维度的层次化分类体系，统一了自然语言生成、图像理解、图像生成以及图文交错生成四种典型场景的评估标准。基于该体系，我们创建了FRAbench基准测试，包含6.04万对样本和32.5万个维度级标签，数据来源于人工标注与LLM标注的结合。FRAbench是首个用于训练和元评估细粒度LMM评判者的大规模多模态资源。依托FRAbench，我们开发了通用跨任务和跨模态的细粒度评估器GenEval。实验表明：GenEval（1）与GPT-4o及专家标注者保持高度一致性；（2）能稳健迁移至未见任务和模态；（3）揭示了当前LMM在评估方面的系统性缺陷。

A Study on Distributed Strategies for Deep Learning Applications in GPU Clusters

Abstract

arXiv:2505.12832v1 Announce Type: new Abstract: As deep learning models grow in size and complexity, training them efficiently on single GPUs becomes increasingly infeasible. This study investigates the effectiveness of several distributed training strategies-Distributed Data Parallel (DDP), Fully Sharded Data Parallelism (FSDP), and Parameter Server (PS) models-for scalable deep learning on GPU clusters. We conduct empirical evaluations across multiple models and datasets to assess trade-offs in memory usage, training time, GPU utilization, and model accuracy. Our results show that while FSDP reduces GPU memory usage by over 60%, it increases training time by up to 6x compared to DDP. In contrast, asynchronous PS training improves throughput but can lead to degraded accuracy due to stale updates. Through comprehensive analysis, we provide practical insights into the strengths and limitations of each strategy, offering guidance for selecting suitable methods based on system constraints and training objectives.

摘要

随着深度学习模型规模和复杂度的增长，在单GPU上高效训练变得越来越不可行。本研究探讨了分布式数据并行（DDP）、全分片数据并行（FSDP）和参数服务器（PS）三种分布式训练策略在GPU集群上实现可扩展深度学习的有效性。我们通过多模型和多数据集的实证评估，对比分析了内存占用、训练时间、GPU利用率和模型精度等方面的权衡。结果表明：FSDP虽然能降低60%以上的GPU内存使用，但与DDP相比训练时间最多增加6倍；而异步PS训练虽能提升吞吐量，却可能因更新延迟导致精度下降。通过全面分析，我们揭示了各策略的优势与局限，为根据系统约束和训练目标选择合适方法提供了实践指导。

Reasoning BO: Enhancing Bayesian Optimization with Long-Context Reasoning Power of LLMs

Abstract

arXiv:2505.12833v1 Announce Type: new Abstract: Many real-world scientific and industrial applications require the optimization of expensive black-box functions. Bayesian Optimization (BO) provides an effective framework for such problems. However, traditional BO methods are prone to get trapped in local optima and often lack interpretable insights. To address this issue, this paper designs Reasoning BO, a novel framework that leverages reasoning models to guide the sampling process in BO while incorporating multi-agent systems and knowledge graphs for online knowledge accumulation. By integrating the reasoning and contextual understanding capabilities of Large Language Models (LLMs), we can provide strong guidance to enhance the BO process. As the optimization progresses, Reasoning BO provides real-time sampling recommendations along with critical insights grounded in plausible scientific theories, aiding in the discovery of superior solutions within the search space. We systematically evaluate our approach across 10 diverse tasks encompassing synthetic mathematical functions and complex real-world applications. The framework demonstrates its capability to progressively refine sampling strategies through real-time insights and hypothesis evolution, effectively identifying higher-performing regions of the search space for focused exploration. This process highlights the powerful reasoning and context-learning abilities of LLMs in optimization scenarios. For example, in the Direct Arylation task, our method increased the yield to 60.7%, whereas traditional BO achieved only a 25.2% yield. Furthermore, our investigation reveals that smaller LLMs, when fine-tuned through reinforcement learning, can attain comparable performance to their larger counterparts. This enhanced reasoning capability paves the way for more efficient automated scientific experimentation while maintaining computational feasibility.

摘要

许多现实世界的科学和工业应用需要对昂贵的黑盒函数进行优化。贝叶斯优化(BO)为此类问题提供了有效框架。然而传统BO方法容易陷入局部最优且往往缺乏可解释性见解。针对这一问题，本文设计了推理BO这一新颖框架，该框架利用推理模型指导BO采样过程，同时结合多智能体系统和知识图谱实现在线知识积累。通过整合大语言模型(LLMs)的推理与上下文理解能力，我们能为BO过程提供强有力的指导。随着优化进程推进，推理BO能基于合理科学理论提供实时采样建议与关键见解，帮助在搜索空间中发现更优解。我们在包含合成数学函数和复杂实际应用的10项不同任务中系统评估了该方法。该框架展现出通过实时洞察与假设演化逐步完善采样策略的能力，能有效识别搜索空间中更高性能区域进行聚焦探索。这一过程凸显了LLMs在优化场景中强大的推理与情境学习能力。例如在直接芳基化反应任务中，我们的方法将产率提升至60.7%，而传统BO仅获得25.2%产率。此外研究发现，经过强化学习微调的小型LLMs可获得与大型模型相当的性能。这种增强的推理能力在保持计算可行性的同时，为更高效的自动化科学实验开辟了道路。

Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks

Abstract

arXiv:2505.12845v1 Announce Type: new Abstract: RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences, demonstrating exceptional and measurable efficacy in instruction following tasks; however, it exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks. Conventional approaches rely heavily on human annotation or more sophisticated large language models, thereby introducing substantial resource expenditure or potential bias concerns. Meanwhile, alternative synthetic methods that augment standard preference datasets often compromise the model's semantic quality. Our research identifies a critical oversight in existing techniques, which predominantly focus on comparing responses while neglecting valuable latent signals embedded within prompt inputs, and which only focus on preference disparities at the intra-sample level, while neglecting to account for the inter-sample level preference differentials that exist among preference data. To leverage these previously neglected indicators, we propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities. Specifically, for any given response in original preference data pairs, we construct varied prompts with a preference relation under different conditions, in order to learn intra-sample level preference disparities. Furthermore, for any given original preference pair, we synthesize multi-instruction preference pairs to capture preference discrepancies at the inter-sample level. Building on the two datasets constructed above, we consequently devise two sophisticated training objective functions. Subsequently, our framework integrates seamlessly into both Reward Modeling and Direct Preference Optimization paradigms. Through rigorous evaluation across multiple benchmarks, we empirically validate the efficacy of our framework.

摘要

基于人类反馈的强化学习（RLHF）已成为人工智能系统与人类偏好对齐的主流方法，在指令跟随任务中展现出卓越且可衡量的效能；然而在面对复杂多指令任务时，其合规能力表现不足。传统方法严重依赖人工标注或更复杂的大型语言模型，从而导致大量资源消耗或潜在偏差问题。与此同时，其他通过增强标准偏好数据集的合成方法往往损害模型的语义质量。我们的研究发现现有技术存在关键性疏忽：主要聚焦于响应比较而忽视了提示输入中蕴含的宝贵潜在信号，且仅关注样本内层面的偏好差异，而未能考量偏好数据间存在的样本间层面偏好差异。为利用这些被忽视的指标，我们提出新型多层级感知偏好学习（MAPL）框架以增强多指令处理能力。具体而言，针对原始偏好数据对中的任一响应，我们构建不同条件下具有偏好关系的多样化提示，以学习样本内层面的偏好差异。此外，针对任一原始偏好对，我们合成多指令偏好对以捕捉样本间层面的偏好差异。基于上述构建的两个数据集，我们进而设计出两个精密的训练目标函数。随后，该框架可无缝集成至奖励建模和直接偏好优化两种范式。通过在多个基准测试中的严格评估，我们实证验证了本框架的有效性。

Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective

Abstract

arXiv:2505.12886v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have shown impressive capabilities in multi-step reasoning tasks. However, alongside these successes, a more deceptive form of model error has emerged--Reasoning Hallucination--where logically coherent but factually incorrect reasoning traces lead to persuasive yet faulty conclusions. Unlike traditional hallucinations, these errors are embedded within structured reasoning, making them more difficult to detect and potentially more harmful. In this work, we investigate reasoning hallucinations from a mechanistic perspective. We propose the Reasoning Score, which quantifies the depth of reasoning by measuring the divergence between logits obtained from projecting late layers of LRMs to the vocabulary space, effectively distinguishing shallow pattern-matching from genuine deep reasoning. Using this score, we conduct an in-depth analysis on the ReTruthQA dataset and identify two key reasoning hallucination patterns: early-stage fluctuation in reasoning depth and incorrect backtracking to flawed prior steps. These insights motivate our Reasoning Hallucination Detection (RHD) framework, which achieves state-of-the-art performance across multiple domains. To mitigate reasoning hallucinations, we further introduce GRPO-R, an enhanced reinforcement learning algorithm that incorporates step-level deep reasoning rewards via potential-based shaping. Our theoretical analysis establishes stronger generalization guarantees, and experiments demonstrate improved reasoning quality and reduced hallucination rates.

摘要

大型推理模型（LRMs）在多步推理任务中展现出卓越能力。然而伴随这些成功，一种更具欺骗性的模型错误形式——'推理幻觉'逐渐显现：即逻辑连贯但事实错误的推理轨迹导致具有说服力却存在缺陷的结论。与传统幻觉不同，这类错误嵌入结构化推理过程中，使其更难以检测且潜在危害更大。本研究从机制角度探究推理幻觉现象，提出'推理分数'量化推理深度——通过测量模型深层投影至词汇空间的logits差异，有效区分浅层模式匹配与真正深度推理。基于该指标，我们在ReTruthQA数据集上开展深入分析，发现两种关键推理幻觉模式：早期推理深度波动及错误回溯至缺陷步骤。这些发现催生出推理幻觉检测框架（RHD），其在多领域实现最先进性能。为缓解推理幻觉，我们进一步提出GRPO-R强化学习算法，通过基于势函数的步骤级深度推理奖励进行增强。理论分析证明了更强的泛化保证，实验表明该方法能提升推理质量并降低幻觉率。

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Abstract

arXiv:2505.12891v1 Announce Type: new Abstract: Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

摘要

时间推理对于大语言模型(LLMs)理解现实世界至关重要。然而现有研究忽视了时间推理面临的现实挑战：(1)密集的时间信息，(2)快速变化的事件动态，以及(3)社交互动中复杂的时间依赖关系。为弥补这一空白，我们提出了面向现实场景的多层次时间推理基准TIME。该基准包含38,522个问答对，涵盖3个层级11个细粒度子任务，由反映不同现实挑战的3个子数据集组成：TIME-Wiki、TIME-News和TIME-Dial。我们对推理模型和非推理模型进行了广泛实验，深入分析了不同现实场景和任务中的时间推理表现，并总结了测试时扩展对时间推理能力的影响。此外，我们发布了人工标注的子集TIME-Lite以促进未来时间推理研究和标准化评估。代码发布于https://github.com/sylvain-wei/TIME，数据集发布于https://huggingface.co/datasets/SylvainWei/TIME。

LLM-KG-Bench 3.0: A Compass for SemanticTechnology Capabilities in the Ocean of LLMs

Abstract

arXiv:2505.13098v1 Announce Type: new Abstract: Current Large Language Models (LLMs) can assist developing program code beside many other things, but can they support working with Knowledge Graphs (KGs) as well? Which LLM is offering the best capabilities in the field of Semantic Web and Knowledge Graph Engineering (KGE)? Is this possible to determine without checking many answers manually? The LLM-KG-Bench framework in Version 3.0 is designed to answer these questions. It consists of an extensible set of tasks for automated evaluation of LLM answers and covers different aspects of working with semantic technologies. In this paper the LLM-KG-Bench framework is presented in Version 3 along with a dataset of prompts, answers and evaluations generated with it and several state-of-the-art LLMs. Significant enhancements have been made to the framework since its initial release, including an updated task API that offers greater flexibility in handling evaluation tasks, revised tasks, and extended support for various open models through the vllm library, among other improvements. A comprehensive dataset has been generated using more than 30 contemporary open and proprietary LLMs, enabling the creation of exemplary model cards that demonstrate the models' capabilities in working with RDF and SPARQL, as well as comparing their performance on Turtle and JSON-LD RDF serialization tasks.

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Abstract

arXiv:2505.13031v1 Announce Type: new Abstract: Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion module, ii) supervised fine-tuning with Chain-of-Thought (CoT) instruction data, and iii) our proposed Reasoning Generation Policy Optimization (RGPO) algorithm, utilizing multimodal feedback to effectively guide policy updates. Experimental results demonstrate that MindOmni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public at \href{https://github.com/EasonXiao-888/MindOmni}{https://github.com/EasonXiao-888/MindOmni}.

摘要

当前文本到图像系统在处理多模态输入和复杂推理任务时存在局限性。我们提出MindOmni——一个统一的多模态大语言模型，通过强化学习驱动的推理生成机制解决这些挑战。该模型采用三阶段训练策略：i) 设计具有仅解码器扩散模块的统一视觉语言模型；ii) 使用思维链(CoT)指令数据进行监督微调；iii) 提出推理生成策略优化(RGPO)算法，利用多模态反馈有效指导策略更新。实验结果表明，MindOmni在理解和生成基准测试中均超越现有模型，同时展现出先进的细粒度推理生成能力，特别是在数学推理指令方面表现突出。所有代码将公开于\href{https://github.com/EasonXiao-888/MindOmni}{https://github.com/EasonXiao-888/MindOmni}。

Language Models That Walk the Talk: A Framework for Formal Fairness Certificates

Abstract

arXiv:2505.12767v1 Announce Type: new Abstract: As large language models become integral to high-stakes applications, ensuring their robustness and fairness is critical. Despite their success, large language models remain vulnerable to adversarial attacks, where small perturbations, such as synonym substitutions, can alter model predictions, posing risks in fairness-critical areas, such as gender bias mitigation, and safety-critical areas, such as toxicity detection. While formal verification has been explored for neural networks, its application to large language models remains limited. This work presents a holistic verification framework to certify the robustness of transformer-based language models, with a focus on ensuring gender fairness and consistent outputs across different gender-related terms. Furthermore, we extend this methodology to toxicity detection, offering formal guarantees that adversarially manipulated toxic inputs are consistently detected and appropriately censored, thereby ensuring the reliability of moderation systems. By formalizing robustness within the embedding space, this work strengthens the reliability of language models in ethical AI deployment and content moderation.

摘要

随着大语言模型在高风险应用中的普及，确保其鲁棒性与公平性变得至关重要。尽管取得了显著成效，大语言模型仍易受对抗性攻击的影响——诸如同义词替换等微小扰动即可改变模型预测结果，这在性别偏见缓解等公平性关键领域及毒性检测等安全性关键领域构成潜在风险。虽然形式化验证方法在神经网络中已有探索，但其在大语言模型中的应用仍显不足。本研究提出一个整体验证框架，用于认证基于Transformer的语言模型的鲁棒性，重点保障不同性别相关术语处理中的性别公平性与输出一致性。此外，我们将该方法扩展至毒性检测领域，通过形式化保证确保对抗性篡改的有毒输入能被持续检测并恰当过滤，从而增强内容审核系统的可靠性。通过在嵌入空间形式化定义鲁棒性，本工作为伦理人工智能部署和内容审核中的语言模型可靠性提供了强化方案。

CAIM: Development and Evaluation of a Cognitive AI Memory Framework for Long-Term Interaction with Intelligent Agents

Abstract

arXiv:2505.13044v1 Announce Type: new Abstract: Large language models (LLMs) have advanced the field of artificial intelligence (AI) and are a powerful enabler for interactive systems. However, they still face challenges in long-term interactions that require adaptation towards the user as well as contextual knowledge and understanding of the ever-changing environment. To overcome these challenges, holistic memory modeling is required to efficiently retrieve and store relevant information across interaction sessions for suitable responses. Cognitive AI, which aims to simulate the human thought process in a computerized model, highlights interesting aspects, such as thoughts, memory mechanisms, and decision-making, that can contribute towards improved memory modeling for LLMs. Inspired by these cognitive AI principles, we propose our memory framework CAIM. CAIM consists of three modules: 1.) The Memory Controller as the central decision unit; 2.) the Memory Retrieval, which filters relevant data for interaction upon request; and 3.) the Post-Thinking, which maintains the memory storage. We compare CAIM against existing approaches, focusing on metrics such as retrieval accuracy, response correctness, contextual coherence, and memory storage. The results demonstrate that CAIM outperforms baseline frameworks across different metrics, highlighting its context-awareness and potential to improve long-term human-AI interactions.

摘要

大型语言模型（LLMs）推动了人工智能（AI）领域的发展，成为交互系统的强大赋能工具。然而，在需要适应用户需求、具备情境知识并理解动态环境的长期交互中，这些模型仍面临挑战。为克服这些挑战，需采用整体记忆建模来高效检索和存储跨交互会话的相关信息，以生成恰当响应。认知AI旨在通过计算模型模拟人类思维过程，其强调的思维、记忆机制和决策等关键要素，可为改进LLMs的记忆建模提供启示。基于这些认知AI原理，我们提出记忆框架CAIM。该框架包含三个模块：1）作为核心决策单元的存储器控制器；2）按请求筛选交互相关数据的记忆检索模块；3）负责记忆存储维护的后思考模块。通过检索准确率、响应正确性、上下文连贯性和记忆存储等指标，我们将CAIM与现有方法进行比较。结果表明CAIM在各项指标上均优于基线框架，凸显其情境感知能力和提升人机长期交互的潜力。

Zero-Shot Iterative Formalization and Planning in Partially Observable Environments

Abstract

arXiv:2505.13126v1 Announce Type: new Abstract: In planning, using LLMs not to predict plans but to formalize an environment into the Planning Domain Definition Language (PDDL) has been shown to greatly improve performance and control. While most work focused on fully observable environments, we tackle the more realistic and challenging partially observable environments where existing methods are incapacitated by the lack of complete information. We propose PDDLego+, a framework to iteratively formalize, plan, grow, and refine PDDL representations in a zero-shot manner, without needing access to any existing trajectories. On two textual simulated environments, we show that PDDLego+ not only achieves superior performance, but also shows robustness against problem complexity. We also show that the domain knowledge captured after a successful trial is interpretable and benefits future tasks.

摘要

在规划领域，利用大型语言模型（LLM）不直接预测计划，而是将环境形式化为规划域定义语言（PDDL）的方法已被证明能显著提升性能与控制力。现有研究多集中于完全可观测环境，而本文针对更具现实意义和挑战性的部分可观测环境——现有方法因信息缺失而失效的场景。我们提出PDDLego+框架，以零样本方式迭代实现PDDL表示的形式化、规划、扩展与优化，且无需依赖任何现有轨迹数据。在两个文本模拟环境中的实验表明，PDDLego+不仅表现出卓越性能，还对问题复杂度具有强鲁棒性。研究同时证实，成功试验后获取的领域知识具备可解释性，并能迁移至后续任务。

The Traitors: Deception and Trust in Multi-Agent Language Model Simulations

Abstract

arXiv:2505.12923v1 Announce Type: new Abstract: As AI systems increasingly assume roles where trust and alignment with human values are essential, understanding when and why they engage in deception has become a critical research priority. We introduce The Traitors, a multi-agent simulation framework inspired by social deduction games, designed to probe deception, trust formation, and strategic communication among large language model (LLM) agents under asymmetric information. A minority of agents the traitors seek to mislead the majority, while the faithful must infer hidden identities through dialogue and reasoning. Our contributions are: (1) we ground the environment in formal frameworks from game theory, behavioral economics, and social cognition; (2) we develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality; (3) we implement a fully autonomous simulation platform where LLMs reason over persistent memory and evolving social dynamics, with support for heterogeneous agent populations, specialized traits, and adaptive behaviors. Our initial experiments across DeepSeek-V3, GPT-4o-mini, and GPT-4o (10 runs per model) reveal a notable asymmetry: advanced models like GPT-4o demonstrate superior deceptive capabilities yet exhibit disproportionate vulnerability to others' falsehoods. This suggests deception skills may scale faster than detection abilities. Overall, The Traitors provides a focused, configurable testbed for investigating LLM behavior in socially nuanced interactions. We position this work as a contribution toward more rigorous research on deception mechanisms, alignment challenges, and the broader social reliability of AI systems.

摘要

随着人工智能系统日益承担起需要信任和人类价值对齐的关键角色，理解其何时及为何进行欺骗已成为重要研究课题。我们提出《背叛者》——一个受社交推理游戏启发的多智能体仿真框架，旨在探究非对称信息下大语言模型（LLM）智能体的欺骗行为、信任形成与策略性沟通。在该框架中，少数'背叛者'智能体试图误导多数群体，而'忠诚者'必须通过对话推理识别隐藏身份。我们的贡献在于：（1）将环境构建于博弈论、行为经济学和社会认知的形式化框架；（2）开发了衡量欺骗成功率、信任动态和集体推理质量的评估指标体系；（3）实现了完全自主的仿真平台，支持LLM基于持久记忆和动态社交关系进行推理，并可配置异构智能体种群、专属特征及自适应行为。基于DeepSeek-V3、GPT-4o-mini和GPT-4o的初步实验（每个模型10次运行）揭示显著不对称性：GPT-4o等先进模型虽展现卓越欺骗能力，却对他人谎言表现出异常脆弱的识别力，暗示欺骗技能的提升速度可能超越检测能力。该框架为研究LLM在复杂社交互动中的行为提供了可配置的标准化测试环境，我们期望这项工作能推动关于AI欺骗机制、对齐挑战及社会可靠性的更严谨研究。

Agentic Publications: An LLM-Driven Framework for Interactive Scientific Publishing, Supplementing Traditional Papers with AI-Powered Knowledge Systems

Abstract

arXiv:2505.13246v1 Announce Type: new Abstract: The exponential growth of scientific literature presents significant challenges for researchers navigating the complex knowledge landscape. We propose "Agentic Publications", a novel LLM-driven framework complementing traditional publishing by transforming papers into interactive knowledge systems. Our architecture integrates structured data with unstructured content through retrieval-augmented generation and multi-agent verification. The framework offers interfaces for both humans and machines, combining narrative explanations with machine-readable outputs while addressing ethical considerations through automated validation and transparent governance. Key features include continuous knowledge updates, automatic integration of new findings, and customizable detail levels. Our proof-of-concept demonstrates multilingual interaction, API accessibility, and structured knowledge representation through vector databases, knowledge graphs, and verification agents. This approach enhances scientific communication across disciplines, improving efficiency and collaboration while preserving traditional publishing pathways, particularly valuable for interdisciplinary fields where knowledge integration remains challenging.

摘要

科学文献的指数级增长为研究人员在复杂知识领域的探索带来了重大挑战。我们提出"能动性出版物"这一新型大语言模型驱动框架，通过将论文转化为交互式知识系统，对传统出版模式形成补充。该架构通过检索增强生成和多智能体验证技术，将结构化数据与非结构化内容相整合。框架为人类用户和机器系统提供了双重接口，在实现叙述性解释与机器可读输出相结合的同时，通过自动化验证和透明治理机制解决伦理问题。核心特征包括持续知识更新、新发现的自动整合以及可定制的细节层级。概念验证展示了多语言交互、API可访问性，以及通过向量数据库、知识图谱和验证智能体实现的结构化知识表征。该方法在保留传统出版渠道的同时，显著提升了跨学科科学交流效率与合作水平，对于知识整合仍具挑战性的交叉学科领域尤为有益。

Abstract

arXiv:2505.13175v1 Announce Type: new Abstract: The emerging paradigm of leveraging pretrained large language models (LLMs) for time series forecasting has predominantly employed linguistic-temporal modality alignment strategies through token-level or layer-wise feature mapping. However, these approaches fundamentally neglect a critical insight: the core competency of LLMs resides not merely in processing localized token features but in their inherent capacity to model holistic sequence structures. This paper posits that effective cross-modal alignment necessitates structural consistency at the sequence level. We propose the Structure-Guided Cross-Modal Alignment (SGCMA), a framework that fully exploits and aligns the state-transition graph structures shared by time-series and linguistic data as sequential modalities, thereby endowing time series with language-like properties and delivering stronger generalization after modality alignment. SGCMA consists of two key components, namely Structure Alignment and Semantic Alignment. In Structure Alignment, a state transition matrix is learned from text data through Hidden Markov Models (HMMs), and a shallow transformer-based Maximum Entropy Markov Model (MEMM) receives the hot-start transition matrix and annotates each temporal patch into state probability, ensuring that the temporal representation sequence inherits language-like sequential dynamics. In Semantic Alignment, cross-attention is applied between temporal patches and the top-k tokens within each state, and the ultimate temporal embeddings are derived by the expected value of these embeddings using a weighted average based on state probabilities. Experiments on multiple benchmarks demonstrate that SGCMA achieves state-of-the-art performance, offering a novel approach to cross-modal alignment in time series forecasting.

摘要

当前利用预训练大语言模型（LLMs）进行时间序列预测的新兴范式，主要采用通过词元级或分层特征映射的语言-时序模态对齐策略。然而，这些方法从根本上忽视了一个关键洞见：LLMs的核心能力不仅在于处理局部词元特征，更在于其建模整体序列结构的内在能力。本文提出，有效的跨模态对齐需要在序列层面保持结构一致性。我们设计了结构引导的跨模态对齐框架（SGCMA），该框架充分挖掘并对齐时间序列与语言数据作为序列模态所共享的状态转移图结构，从而赋予时间序列类语言特性，并在模态对齐后实现更强的泛化能力。SGCMA包含两个核心组件：结构对齐与语义对齐。在结构对齐中，通过隐马尔可夫模型（HMMs）从文本数据学习状态转移矩阵，随后基于浅层Transformer的最大熵马尔可夫模型（MEMM）接收热启动的转移矩阵，并将每个时间片段标注为状态概率，确保时序表征序列继承类语言的序列动态特性。在语义对齐中，跨注意力机制被应用于时间片段与各状态下top-k词元之间，最终通过基于状态概率的加权平均计算这些嵌入的期望值，得到时序嵌入表示。多基准实验表明，SGCMA实现了最先进的性能，为时间序列预测中的跨模态对齐提供了新范式。

Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

Abstract

arXiv:2505.13195v1 Announce Type: new Abstract: As Large Language Models (LLMs) become increasingly integrated into real-world decision-making systems, understanding their behavioural vulnerabilities remains a critical challenge for AI safety and alignment. While existing evaluation metrics focus primarily on reasoning accuracy or factual correctness, they often overlook whether LLMs are robust to adversarial manipulation or capable of using adaptive strategy in dynamic environments. This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of LLMs under interactive and adversarial conditions. Drawing on methodologies from cognitive psychology and game theory, our framework probes how models respond in two canonical tasks: the two-armed bandit task and the Multi-Round Trust Task. These tasks capture key aspects of exploration-exploitation trade-offs, social cooperation, and strategic flexibility. We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3, revealing model-specific susceptibilities to manipulation and rigidity in strategy adaptation. Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment. Rather than offering a performance benchmark, this work proposes a methodology for diagnosing decision-making weaknesses in LLM-based agents, providing actionable insights for alignment and safety research.

摘要

随着大语言模型（LLMs）日益融入现实世界决策系统，理解其行为脆弱性仍是人工智能安全与对齐研究的核心挑战。现有评估指标主要关注推理准确性或事实正确性，却往往忽视LLMs对对抗性操纵的鲁棒性及在动态环境中运用适应性策略的能力。本文提出一种对抗性评估框架，旨在交互与对抗条件下系统化压力测试LLMs的决策过程。借鉴认知心理学与博弈论方法，该框架通过双臂老虎机任务和多轮信任任务两类经典实验，探究模型在探索-开发权衡、社会合作及策略灵活性等关键维度上的表现。我们将此框架应用于GPT-3.5、GPT-4、Gemini-1.5和DeepSeek-V3等前沿模型，揭示了模型在策略适应性方面特有的易操纵性与僵化特征。研究发现不仅凸显了各模型间的行为模式差异，更强调了适应性能力与公平性认知对可信AI部署的重要性。本研究并非提供性能基准，而是提出一种诊断基于LLM智能体决策缺陷的方法论，为对齐与安全研究提供可操作的洞见。

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Abstract

arXiv:2505.13180v1 Announce Type: new Abstract: Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.

摘要

与自然语言规划相比，将大语言模型与符号规划器相结合是获得可验证且接地气规划的有前景方向，近期研究通过视觉语言模型（VLM）将该思路扩展至视觉领域。然而，由于缺乏统一的环境、评估协议和模型覆盖范围，视觉语言模型接地的符号规划方法与直接使用视觉语言模型规划的方法之间一直难以进行严格比较。我们推出ViPlan——首个支持符号谓词和视觉语言模型的开源视觉规划基准。ViPlan包含两个领域中难度递增的任务系列：经典积木世界规划问题的视觉变体，以及模拟家庭机器人环境。我们对九种不同规模的开源视觉语言模型家族及部分闭源模型进行基准测试，评估基于视觉语言模型接地的符号规划和直接使用模型生成动作两种方法。研究发现：在需要精确图像接地的积木世界任务中，符号规划优于直接视觉语言模型规划；而在需要常识知识和错误恢复能力的家庭机器人任务中，结果相反。最后实验表明，对于大多数模型和方法，使用思维链提示并未带来显著收益，这表明当前视觉语言模型在视觉推理方面仍存在局限。

Multi-Armed Bandits Meet Large Language Models

Abstract

arXiv:2505.13355v1 Announce Type: new Abstract: Bandit algorithms and Large Language Models (LLMs) have emerged as powerful tools in artificial intelligence, each addressing distinct yet complementary challenges in decision-making and natural language processing. This survey explores the synergistic potential between these two fields, highlighting how bandit algorithms can enhance the performance of LLMs and how LLMs, in turn, can provide novel insights for improving bandit-based decision-making. We first examine the role of bandit algorithms in optimizing LLM fine-tuning, prompt engineering, and adaptive response generation, focusing on their ability to balance exploration and exploitation in large-scale learning tasks. Subsequently, we explore how LLMs can augment bandit algorithms through advanced contextual understanding, dynamic adaptation, and improved policy selection using natural language reasoning. By providing a comprehensive review of existing research and identifying key challenges and opportunities, this survey aims to bridge the gap between bandit algorithms and LLMs, paving the way for innovative applications and interdisciplinary research in AI.

摘要

赌博算法与大型语言模型（LLMs）已成为人工智能领域的强大工具，分别在决策制定和自然语言处理中解决独特而互补的挑战。本文综述探讨了这两个领域之间的协同潜力，重点分析了赌博算法如何提升LLMs的性能，以及LLMs如何为改进基于赌博算法的决策提供新思路。我们首先研究赌博算法在优化LLM微调、提示工程和自适应响应生成中的作用，重点关注其在大规模学习任务中平衡探索与利用的能力。随后，我们探讨LLMs如何通过高级上下文理解、动态适应和基于自然语言推理的策略选择来增强赌博算法。通过对现有研究的全面回顾及关键挑战与机遇的梳理，本综述旨在弥合赌博算法与LLMs之间的鸿沟，为人工智能领域的创新应用与跨学科研究铺平道路。

CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

Abstract

arXiv:2505.13380v1 Announce Type: new Abstract: Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526

摘要

稀疏专家混合模型（SMoE）为超越单纯增加网络深度或宽度的传统方法提供了一种扩展模型复杂度的有效方案。然而，我们认为当前SMoE训练仍面临挑战，这源于其路由过程存在缺陷——执行计算的专家并未直接参与路由决策。本研究提出"竞争机制"这一创新方法，通过将令牌分配给具有最高神经响应的专家来实现路由。理论分析表明，该机制相比传统softmax路由具有更优的样本效率。基于此，我们开发了CompeteSMoE算法：通过部署学习竞争策略的路由器，该算法能以较低训练开销实现强大性能。在视觉指令微调与语言预训练任务上的大量实验表明，相较于最先进的SMoE策略，CompeteSMoE展现出卓越的效能、鲁棒性和可扩展性。实现代码已开源：https://github.com/Fsoft-AIC/CompeteSMoE。本工作是对arXiv:2402.02526先前研究的改进版本。

AutoMathKG: The automated mathematical knowledge graph based on LLM and vector database

Abstract

arXiv:2505.13406v1 Announce Type: new Abstract: A mathematical knowledge graph (KG) presents knowledge within the field of mathematics in a structured manner. Constructing a math KG using natural language is an essential but challenging task. There are two major limitations of existing works: first, they are constrained by corpus completeness, often discarding or manually supplementing incomplete knowledge; second, they typically fail to fully automate the integration of diverse knowledge sources. This paper proposes AutoMathKG, a high-quality, wide-coverage, and multi-dimensional math KG capable of automatic updates. AutoMathKG regards mathematics as a vast directed graph composed of Definition, Theorem, and Problem entities, with their reference relationships as edges. It integrates knowledge from ProofWiki, textbooks, arXiv papers, and TheoremQA, enhancing entities and relationships with large language models (LLMs) via in-context learning for data augmentation. To search for similar entities, MathVD, a vector database, is built through two designed embedding strategies using SBERT. To automatically update, two mechanisms are proposed. For knowledge completion mechanism, Math LLM is developed to interact with AutoMathKG, providing missing proofs or solutions. For knowledge fusion mechanism, MathVD is used to retrieve similar entities, and LLM is used to determine whether to merge with a candidate or add as a new entity. A wide range of experiments demonstrate the advanced performance and broad applicability of the AutoMathKG system, including superior reachability query results in MathVD compared to five baselines and robust mathematical reasoning capability in Math LLM.

摘要

数学知识图谱（KG）以结构化方式呈现数学领域的知识。利用自然语言构建数学KG是一项重要但具有挑战性的任务。现有研究存在两大局限：首先受限于语料库完整性，常需丢弃或人工补充不完整知识；其次通常无法实现多源知识的全自动整合。本文提出AutoMathKG，一个支持自动更新的高质量、广覆盖、多维度的数学KG。该系统将数学视为由定义、定理和问题实体构成的巨型有向图，其引用关系作为边。通过整合ProofWiki、教科书、arXiv论文和TheoremQA的知识，并采用上下文学习的大语言模型（LLM）进行数据增强以完善实体与关系。为检索相似实体，基于SBERT设计两种嵌入策略构建向量数据库MathVD。为实现自动更新，提出两种机制：知识补全机制通过开发的Math LLM与AutoMathKG交互，提供缺失证明或解答；知识融合机制利用MathVD检索相似实体，由LLM判定与候选实体合并或新增实体。大量实验表明AutoMathKG系统具有先进性能与广泛适用性，包括MathVD在可达性查询中优于五种基线方法，以及Math LLM展现的强健数学推理能力。

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

Abstract

arXiv:2505.13427v1 Announce Type: new Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

摘要

虽然多模态大语言模型（MLLMs）在视觉语言理解方面取得了显著进展，但其在复杂多步推理任务中仍存在困难，常产生逻辑不一致或部分正确的解决方案。关键限制在于缺乏对中间推理步骤的细粒度监督。为此，我们提出MM-PRM——一个在全自动化、可扩展框架下训练的过程奖励模型。首先构建MM-Policy（基于多样化数学推理数据训练的强大多模态模型），随后创建包含10,000道可验证答案的多模态数学题精选数据集MM-K12作为种子数据。通过基于蒙特卡洛树搜索（MCTS）的流程，我们在无需人工标注的情况下生成超过70万条步骤级注释。所得PRM模型用于在Best-of-N推理设置中对候选推理路径进行评分，在领域内（MM-K12测试集）和跨领域（OlympiadBench、MathVista等）基准测试中均实现显著提升。进一步分析证实了软标签、较小学习率和路径多样性对优化PRM性能的有效性。MM-PRM证明过程监督是增强多模态推理系统逻辑鲁棒性的有力工具。所有代码和数据已发布于https://github.com/ModalMinds/MM-PRM。

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Abstract

arXiv:2505.13445v1 Announce Type: new Abstract: Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is ``superficial self-reflection'', where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model's problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.

摘要

大语言模型（LLMs）在复杂推理任务中展现出巨大潜力，其中基于可验证奖励的强化学习（RLVR）是关键增强策略。然而，当前普遍存在"表面化自我反思"问题，即模型无法稳健验证自身输出。为此，我们提出RISE（通过自我验证强化推理），一种创新的在线强化学习框架。RISE通过单一集成式强化学习过程，显式且同步地训练大语言模型提升其问题解决与自我验证能力。其核心机制在于利用结果验证器提供的可验证奖励，为解决方案生成和自我验证任务提供实时反馈。在每次迭代中，模型首先生成解决方案，随后对同策略生成的解决方案进行批判性评估，两条轨迹共同参与策略更新。在多样化数学推理基准测试上的大量实验表明，RISE能持续提升模型的问题解决准确率，同时培养强大的自我验证能力。我们的分析凸显了在线验证的优势以及增加验证计算资源的益处。此外，RISE模型在推理过程中表现出更频繁且准确的自我验证行为。这些优势使RISE成为开发更具鲁棒性和自我意识推理器的灵活有效路径。

CoT-Kinetics: A Theoretical Modeling Assessing LRM Reasoning Process

Abstract

arXiv:2505.13408v1 Announce Type: new Abstract: Recent Large Reasoning Models significantly improve the reasoning ability of Large Language Models by learning to reason, exhibiting the promising performance in solving complex tasks. LRMs solve tasks that require complex reasoning by explicitly generating reasoning trajectories together with answers. Nevertheless, judging the quality of such an output answer is not easy because only considering the correctness of the answer is not enough and the soundness of the reasoning trajectory part matters as well. Logically, if the soundness of the reasoning part is poor, even if the answer is correct, the confidence of the derived answer should be low. Existing methods did consider jointly assessing the overall output answer by taking into account the reasoning part, however, their capability is still not satisfactory as the causal relationship of the reasoning to the concluded answer cannot properly reflected. In this paper, inspired by classical mechanics, we present a novel approach towards establishing a CoT-Kinetics energy equation. Specifically, our CoT-Kinetics energy equation formulates the token state transformation process, which is regulated by LRM internal transformer layers, as like a particle kinetics dynamics governed in a mechanical field. Our CoT-Kinetics energy assigns a scalar score to evaluate specifically the soundness of the reasoning phase, telling how confident the derived answer could be given the evaluated reasoning. As such, the LRM's overall output quality can be accurately measured, rather than a coarse judgment (e.g., correct or incorrect) anymore.

摘要

近期的大型推理模型通过学习推理能力显著提升了大型语言模型的推理性能，在解决复杂任务中展现出优异表现。这类模型通过显式生成推理轨迹与答案来应对需要复杂推理的任务。然而，评估此类输出答案的质量并非易事，因为仅考虑答案正确性并不足够，推理轨迹部分的合理性同样至关重要。从逻辑上讲，若推理部分的合理性不足，即使答案正确，所得答案的可信度也应较低。现有方法虽已尝试通过结合推理部分来联合评估整体输出答案，但其能力仍不尽如人意，因为推理与结论答案之间的因果关系未能得到恰当反映。本文受经典力学启发，提出了一种建立"思维链-动力学"能量方程的新方法。具体而言，我们的"思维链-动力学"能量方程将受模型内部Transformer层调控的token状态转换过程，类比为力学场中 governed 的粒子动力学运动。该能量方程通过标量评分专门评估推理阶段的合理性，从而量化给定推理过程下所得答案的可信度。由此，模型整体输出质量可被精确度量，而不再局限于粗糙判断（如正确或错误）。

AI-generated Text Detection: A Multifaceted Approach to Binary and Multiclass Classification

Abstract

arXiv:2505.11550v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing across a wide range of styles and genres. However, such capabilities are prone to potential misuse, such as fake news generation, spam email creation, and misuse in academic assignments. As a result, accurate detection of AI-generated text and identification of the model that generated it are crucial for maintaining the responsible use of LLMs. In this work, we addressed two sub-tasks put forward by the Defactify workshop under AI-Generated Text Detection shared task at the Association for the Advancement of Artificial Intelligence (AAAI 2025): Task A involved distinguishing between human-authored or AI-generated text, while Task B focused on attributing text to its originating language model. For each task, we proposed two neural architectures: an optimized model and a simpler variant. For Task A, the optimized neural architecture achieved fifth place with $F1$ score of 0.994, and for Task B, the simpler neural architecture also ranked fifth place with $F1$ score of 0.627.

摘要

大型语言模型（LLMs）在生成各类风格和体裁的、高度接近人类书写的文本方面展现出卓越能力。然而，这种能力可能被滥用，例如制造虚假新闻、创建垃圾邮件以及在学术作业中的不当使用。因此，准确检测AI生成文本并识别其来源模型对于确保LLMs的负责任使用至关重要。本研究针对人工智能促进协会（AAAI 2025）AI生成文本检测共享任务中Defactify研讨会提出的两个子任务：任务A需区分人类撰写或AI生成的文本，任务B则侧重于追溯文本的原始语言模型。针对每个任务，我们提出了两种神经架构：优化模型和简化变体。在任务A中，优化神经架构以0.994的F1分数位列第五；在任务B中，简化神经架构同样以0.627的F1分数排名第五。

On Technique Identification and Threat-Actor Attribution using LLMs and Embedding Models

Abstract

arXiv:2505.11547v1 Announce Type: cross Abstract: Attribution of cyber-attacks remains a complex but critical challenge for cyber defenders. Currently, manual extraction of behavioral indicators from dense forensic documentation causes significant attribution delays, especially following major incidents at the international scale. This research evaluates large language models (LLMs) for cyber-attack attribution based on behavioral indicators extracted from forensic documentation. We test OpenAI's GPT-4 and text-embedding-3-large for identifying threat actors' tactics, techniques, and procedures (TTPs) by comparing LLM-generated TTPs against human-generated data from MITRE ATT&CK Groups. Our framework then identifies TTPs from text using vector embedding search and builds profiles to attribute new attacks for a machine learning model to learn. Key contributions include: (1) assessing off-the-shelf LLMs for TTP extraction and attribution, and (2) developing an end-to-end pipeline from raw CTI documents to threat-actor prediction. This research finds that standard LLMs generate TTP datasets with noise, resulting in a low similarity to human-generated datasets. However, the TTPs generated are similar in frequency to those within the existing MITRE datasets. Additionally, although these TTPs are different than human-generated datasets, our work demonstrates that they still prove useful for training a model that performs above baseline on attribution. Project code and files are contained here: https://github.com/kylag/ttp_attribution.

摘要

网络攻击归因始终是网络防御者面临的一项复杂而关键的挑战。当前从密集的取证文档中人工提取行为指标会导致显著的归因延迟，尤其在国际重大安全事件发生后更为突出。本研究评估了基于取证文档提取行为指标的大型语言模型（LLMs）在网络攻击归因中的应用。我们测试了OpenAI的GPT-4和text-embedding-3-large模型，通过将LLM生成的战术、技术与程序（TTPs）与MITRE ATT&CK Groups人工标注数据进行比对，识别威胁行为体的TTPs。研究构建的框架首先通过向量嵌入搜索从文本中识别TTPs，继而建立特征档案用于新攻击的归因，最终供机器学习模型学习。主要贡献包括：（1）评估现成LLM在TTP提取与归因中的表现；（2）开发从原始网络威胁情报文档到威胁行为体预测的端到端流程。研究发现标准LLM生成的TTP数据集存在噪声，与人工生成数据集相似度较低，但其生成的TTP频率分布与现有MITRE数据集具有相似性。此外，尽管这些TTP与人工数据集存在差异，但研究表明其仍能有效训练出超越基线水平的归因模型。项目代码及文件详见：https://github.com/kylag/ttp_attribution。

Abstract

arXiv:2505.11557v1 Announce Type: cross Abstract: Corporate LLMs are gaining traction for efficient knowledge dissemination and management within organizations. However, as current LLMs are vulnerable to leaking sensitive information, it has proven difficult to apply them in settings where strict access control is necessary. To this end, we design AC-LoRA, an end-to-end system for access control-aware corporate LLM chatbots that maintains a strong information isolation guarantee. AC-LoRA maintains separate LoRA adapters for permissioned datasets, along with the document embedding they are finetuned on. AC-LoRA retrieves a precise set of LoRA adapters based on the similarity score with the user query and their permission. This similarity score is later used to merge the responses if more than one LoRA is retrieved, without requiring any additional training for LoRA routing. We provide an end-to-end prototype of AC-LoRA, evaluate it on two datasets, and show that AC-LoRA matches or even exceeds the performance of state-of-the-art LoRA mixing techniques while providing strong isolation guarantees. Furthermore, we show that AC-LoRA design can be directly applied to different modalities.

摘要

企业级大语言模型（LLM）正日益成为组织内部高效知识传播与管理的重要工具。然而，由于现有LLM存在敏感信息泄露风险，其在需要严格访问控制的环境中应用仍面临挑战。为此，我们设计了AC-LoRA系统——一种具备访问控制意识的企业级LLM聊天机器人端到端解决方案，可确保强信息隔离。该系统为不同权限数据集维护独立的LoRA适配器及对应的微调文档嵌入，通过用户查询与权限的相似度评分精准检索LoRA适配器集合。当检索到多个适配器时，系统直接利用该评分进行响应融合，无需额外训练LoRA路由模块。我们实现了AC-LoRA的端到端原型，在两个数据集上的实验表明：在提供强隔离保障的同时，其性能达到甚至超越了最先进的LoRA混合技术。此外，研究证实AC-LoRA的设计可直接扩展至多模态场景。

Assessing Collective Reasoning in Multi-Agent LLMs via Hidden Profile Tasks

Abstract

arXiv:2505.11556v1 Announce Type: cross Abstract: Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving through distributed information integration, but also risk replicating collective reasoning failures observed in human groups. Yet, no theory-grounded benchmark exists to systematically evaluate such failures. In this paper, we introduce the Hidden Profile paradigm from social psychology as a diagnostic testbed for multi-agent LLM systems. By distributing critical information asymmetrically across agents, the paradigm reveals how inter-agent dynamics support or hinder collective reasoning. We first formalize the paradigm for multi-agent decision-making under distributed knowledge and instantiate it as a benchmark with nine tasks spanning diverse scenarios, including adaptations from prior human studies. We then conduct experiments with GPT-4.1 and five other leading LLMs, including reasoning-enhanced variants, showing that multi-agent systems across all models fail to match the accuracy of single agents given complete information. While agents' collective performance is broadly comparable to that of human groups, nuanced behavioral differences emerge, such as increased sensitivity to social desirability. Finally, we demonstrate the paradigm's diagnostic utility by exploring a cooperation-contradiction trade-off in multi-agent LLM systems. We find that while cooperative agents are prone to over-coordination in collective settings, increased contradiction impairs group convergence. This work contributes a reproducible framework for evaluating multi-agent LLM systems and motivates future research on artificial collective intelligence and human-AI interaction.

摘要

基于大语言模型（LLM）构建的多智能体系统有望通过分布式信息整合提升问题解决能力，但也可能复现人类群体中观察到的集体推理失败现象。然而目前缺乏理论基础的基准来系统评估此类缺陷。本文引入社会心理学中的"隐藏档案"范式作为多智能体LLM系统的诊断测试平台，通过非对称分布关键信息来揭示智能体间动态如何支持或阻碍集体推理。我们首先将该范式形式化为分布式知识下的多智能体决策框架，并实例化为包含九项任务的基准测试集，涵盖多种场景（包括对人类研究的改编）。随后使用GPT-4.1等六种主流LLM（含推理增强变体）进行实验，结果表明所有模型的多智能体系统均无法达到掌握完整信息的单智能体准确率。虽然智能体集体表现与人类群体大体相当，但存在细微行为差异（如对社会期望的敏感性增强）。最后通过探索多智能体LLM系统中的合作-矛盾权衡验证范式诊断价值：合作型智能体在集体环境中易出现过度协调，而矛盾增加会阻碍群体收敛。本研究贡献了可复现的多智能体LLM系统评估框架，为人工集体智能和人机交互的未来研究提供方向。

One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems

Abstract

arXiv:2505.11548v1 Announce Type: cross Abstract: Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have shown improved performance in generating accurate responses. However, the dependence on external knowledge bases introduces potential security vulnerabilities, particularly when these knowledge bases are publicly accessible and modifiable. Poisoning attacks on knowledge bases for RAG systems face two fundamental challenges: the injected malicious content must compete with multiple authentic documents retrieved by the retriever, and LLMs tend to trust retrieved information that aligns with their internal memorized knowledge. Previous works attempt to address these challenges by injecting multiple malicious documents, but such saturation attacks are easily detectable and impractical in real-world scenarios. To enable the effective single document poisoning attack, we propose AuthChain, a novel knowledge poisoning attack method that leverages Chain-of-Evidence theory and authority effect to craft more convincing poisoned documents. AuthChain generates poisoned content that establishes strong evidence chains and incorporates authoritative statements, effectively overcoming the interference from both authentic documents and LLMs' internal knowledge. Extensive experiments across six popular LLMs demonstrate that AuthChain achieves significantly higher attack success rates while maintaining superior stealthiness against RAG defense mechanisms compared to state-of-the-art baselines.

摘要

增强检索生成（RAG）能力的大型语言模型（LLMs）在生成准确响应方面表现出性能提升。然而，这种对外部知识库的依赖性引入了潜在安全漏洞，特别是当知识库可公开访问和修改时。针对RAG系统知识库的投毒攻击面临两个根本性挑战：注入的恶意内容需与检索器获取的多个真实文档竞争，且LLMs倾向于信任与其内部记忆知识相符的检索信息。现有研究尝试通过注入多份恶意文档来解决这些问题，但此类饱和攻击在现实场景中易被检测且不实用。为实现有效的单文档投毒攻击，我们提出AuthChain——一种基于证据链理论和权威效应的新型知识投毒攻击方法。该方法通过构建强证据链并结合权威性陈述来生成更具说服力的污染文档，有效克服了真实文档和LLMs内部知识的干扰。在六种主流LLMs上的大量实验表明，与现有基线相比，AuthChain在保持对RAG防御机制高度隐蔽性的同时，实现了显著更高的攻击成功率。

InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

Abstract

arXiv:2505.11574v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive performance on complex reasoning benchmarks such as GSM8K, MATH, and AIME. However, the substantial computational demands of these tasks pose significant challenges for real-world deployment. Model quantization has emerged as a promising approach to reduce memory footprint and inference latency by representing weights and activations with lower bit-widths. In this work, we conduct a comprehensive study of mainstream quantization methods(e.g., AWQ, GPTQ, SmoothQuant) on the most popular open-sourced models (e.g., Qwen2.5, LLaMA3 series), and reveal that quantization can degrade mathematical reasoning accuracy by up to 69.81%. To better understand this degradation, we develop an automated assignment and judgment pipeline that qualitatively categorizes failures into four error types and quantitatively identifies the most impacted reasoning capabilities. Building on these findings, we employ an automated data-curation pipeline to construct a compact "Silver Bullet" datasets. Training a quantized model on as few as 332 carefully selected examples for just 3-5 minutes on a single GPU is enough to restore its reasoning accuracy to match that of the full-precision baseline.

摘要

大型语言模型（LLMs）在GSM8K、MATH和AIME等复杂推理基准测试中展现出卓越性能，但其高昂的计算成本给实际部署带来重大挑战。模型量化通过以更低比特宽度表示权重和激活值，成为减少内存占用和推理延迟的有效途径。本研究对主流量化方法（如AWQ、GPTQ、SmoothQuant）在热门开源模型（如Qwen2.5、LLaMA3系列）上的表现进行全面分析，发现量化会导致数学推理准确率最高下降69.81%。为深入理解性能衰减机制，我们开发了自动化任务分配与评估流程，定性归纳出四类错误类型，并定量识别受影响最严重的推理能力。基于这些发现，采用自动化数据筛选流程构建精炼的"银弹"数据集，仅需在单GPU上使用332个精选样本进行3-5分钟训练，即可使量化模型的推理准确率恢复至全精度基线水平。

ACSE-Eval: Can LLMs threat model real-world cloud infrastructure?

Abstract

arXiv:2505.11565v1 Announce Type: cross Abstract: While Large Language Models have shown promise in cybersecurity applications, their effectiveness in identifying security threats within cloud deployments remains unexplored. This paper introduces AWS Cloud Security Engineering Eval, a novel dataset for evaluating LLMs cloud security threat modeling capabilities. ACSE-Eval contains 100 production grade AWS deployment scenarios, each featuring detailed architectural specifications, Infrastructure as Code implementations, documented security vulnerabilities, and associated threat modeling parameters. Our dataset enables systemic assessment of LLMs abilities to identify security risks, analyze attack vectors, and propose mitigation strategies in cloud environments. Our evaluations on ACSE-Eval demonstrate that GPT 4.1 and Gemini 2.5 Pro excel at threat identification, with Gemini 2.5 Pro performing optimally in 0-shot scenarios and GPT 4.1 showing superior results in few-shot settings. While GPT 4.1 maintains a slight overall performance advantage, Claude 3.7 Sonnet generates the most semantically sophisticated threat models but struggles with threat categorization and generalization. To promote reproducibility and advance research in automated cybersecurity threat analysis, we open-source our dataset, evaluation metrics, and methodologies.

摘要

虽然大型语言模型在网络安全应用中展现出潜力，但其在云部署环境中识别安全威胁的有效性尚未得到验证。本文提出AWS云安全工程评估数据集（ACSE-Eval），这是用于评估LLM云安全威胁建模能力的新型基准。该数据集包含100个生产级AWS部署场景，每个场景均具备详细的架构规范、基础设施即代码实现、已记录的安全漏洞及相关威胁建模参数。我们的数据集支持系统评估LLM在云环境中识别安全风险、分析攻击向量及提出缓解策略的能力。基于ACSE-Eval的评估表明，GPT 4.1和Gemini 2.5 Pro在威胁识别方面表现优异，其中Gemini 2.5 Pro在零样本场景中表现最佳，而GPT 4.1在小样本设置中展现优势。虽然GPT 4.1保持轻微的整体性能优势，但Claude 3.7 Sonnet能生成语义最复杂的威胁模型，却在威胁分类与泛化方面存在不足。为促进自动化网络安全威胁分析研究的可重复性与进展，我们开源了数据集、评估指标及方法论。

Tool-Aided Evolutionary LLM for Generative Policy Toward Efficient Resource Management in Wireless Federated Learning

Abstract

arXiv:2505.11570v1 Announce Type: cross Abstract: Federated Learning (FL) enables distributed model training across edge devices in a privacy-friendly manner. However, its efficiency heavily depends on effective device selection and high-dimensional resource allocation in dynamic and heterogeneous wireless environments. Conventional methods demand a confluence of domain-specific expertise, extensive hyperparameter tuning, and/or heavy interaction cost. This paper proposes a Tool-aided Evolutionary Large Language Model (T-ELLM) framework to generate a qualified policy for device selection in a wireless FL environment. Unlike conventional optimization methods, T-ELLM leverages natural language-based scenario prompts to enhance generalization across varying network conditions. The framework decouples the joint optimization problem mathematically, enabling tractable learning of device selection policies while delegating resource allocation to convex optimization tools. To improve adaptability, T-ELLM integrates a sample-efficient, model-based virtual learning environment that captures the relationship between device selection and learning performance, facilitating subsequent group relative policy optimization. This concerted approach reduces reliance on real-world interactions, minimizing communication overhead while maintaining high-fidelity decision-making. Theoretical analysis proves that the discrepancy between virtual and real environments is bounded, ensuring the advantage function learned in the virtual environment maintains a provably small deviation from real-world conditions. Experimental results demonstrate that T-ELLM outperforms benchmark methods in energy efficiency and exhibits robust adaptability to environmental changes.

摘要

联邦学习（FL）通过隐私友好的方式实现跨边缘设备的分布式模型训练。然而，其效率高度依赖于动态异构无线环境中有效的设备选择和高维资源分配。传统方法需要融合领域专业知识、大量超参数调优和/或高昂的交互成本。本文提出工具辅助进化大语言模型（T-ELLM）框架，用于生成无线FL环境中合格的设备选择策略。与传统优化方法不同，T-ELLM利用基于自然语言的场景提示来增强不同网络条件下的泛化能力。该框架通过数学解耦联合优化问题，实现可处理的设备选择策略学习，同时将资源分配委托给凸优化工具。为提高适应性，T-ELLM集成了一种样本高效、基于模型的虚拟学习环境，该环境捕获设备选择与学习性能之间的关系，促进后续群体相对策略优化。这种协同方法降低了对现实交互的依赖，在保持高保真决策的同时最小化通信开销。理论分析证明虚拟环境与现实环境之间的差异是有界的，确保虚拟环境中学习的优势函数与现实条件保持可证明的微小偏差。实验结果表明，T-ELLM在能效方面优于基准方法，并对环境变化表现出强大的适应能力。

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Abstract

arXiv:2505.11594v1 Announce Type: cross Abstract: The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.

摘要

注意力机制因其二次时间复杂度而面临效率挑战。本研究通过两项关键创新提升注意力效率：首先，我们利用Blackwell GPU的新型FP4张量核心加速注意力计算。在RTX5090上实现1038 TOPS算力，较该平台最快的FlashAttention提速5倍。实验表明，所提出的FP4注意力能以即插即用方式加速各类模型推理。其次，我们首次将低位宽注意力应用于训练任务。现有低位宽注意力研究（如FlashAttention3和SageAttention）仅聚焦推理场景，但大模型训练效率同样至关重要。为探究低位宽注意力在训练中的适用性，我们设计了一种精确高效的8位注意力算法，同时支持前向传播与反向传播。实验显示，8位注意力在微调任务中可实现无损性能，但在预训练任务中收敛速度较慢。代码将在https://github.com/thu-ml/SageAttention开源。

The Ripple Effect: On Unforeseen Complications of Backdoor Attacks

Abstract

arXiv:2505.11586v1 Announce Type: cross Abstract: Recent research highlights concerns about the trustworthiness of third-party Pre-Trained Language Models (PTLMs) due to potential backdoor attacks. These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks. In reality, these PTLMs can be adapted to many other unrelated downstream tasks. Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness. We refer to this phenomenon as backdoor complications. In this paper, we undertake the first comprehensive quantification of backdoor complications. Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs. The output distribution of triggered samples significantly deviates from that of clean samples. Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks. The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks. Our code is available at https://github.com/zhangrui4041/Backdoor_Complications.

摘要

近期研究指出，由于潜在的后门攻击风险，第三方预训练语言模型（PTLMs）的可信度引发关注。然而这些被植入后门的PTLMs仅对特定预定义的下游任务有效。现实中，这些PTLMs可被适配至许多其他无关的下游任务，此类适配可能导致下游模型输出出现不可预见的后果，从而引起用户怀疑并破坏攻击隐蔽性。我们将此现象称为后门并发症。本文首次对后门并发症进行全面量化研究，通过使用4个主流PTLMs和16个文本分类基准数据集进行大量实验，证明基于后门PTLMs微调的下游模型中普遍存在后门并发症。触发样本的输出分布与干净样本存在显著偏差。为此，我们提出一种基于多任务学习的后门并发症缓解方法，该方法无需预知下游任务即可降低并发症影响。实验结果表明，所提方法能有效减少并发症，同时保持后门攻击的效力与一致性。代码已开源：https://github.com/zhangrui4041/Backdoor_Complications。

Concept-Guided Interpretability via Neural Chunking

Abstract

arXiv:2505.11576v1 Announce Type: cross Abstract: Neural networks are often black boxes, reflecting the significant challenge of understanding their internal workings. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage cognitively-inspired methods of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality. Discrete sequence chunking (DSC) creates a dictionary of entities; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting entities across varying model sizes, ranging from inducing compositionality in RNNs to uncovering recurring neural population states in large models with diverse architectures, and illustrate their advantage over other methods. Throughout, we observe a robust correspondence between the extracted entities and concrete or abstract concepts. Artificially inducing the extracted entities in neural populations effectively alters the network's generation of associated concepts. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.

摘要

神经网络常被视为黑箱，这反映了理解其内部运作机制的重大挑战。我们提出一个挑战主流观点的新视角：神经网络并非不可解读，其原始群体活动中展现出的模式实际上反映了训练数据中的规律性。我们将此称为"反射假说"，并在简单循环神经网络（RNN）和复杂大语言模型（LLM）中均发现了支持该现象的证据。基于这一发现，我们提出利用认知启发的组块化方法，将高维神经群体动力学分割为反映底层概念的可解释单元。我们开发了三种互补的实体提取方法：离散序列组块化（DSC）建立实体词典；群体平均法（PA）提取对应已知标签的重复实体；无监督组块发现（UCD）适用于无标签场景。这些方法在不同规模模型中均展现出卓越的实体提取能力——从在RNN中诱导组合性，到在多样化架构的大模型中发掘重复出现的神经群体状态，并显示出相对于其他方法的优势。所有实验均观察到提取实体与具体/抽象概念间的强对应关系。通过人工诱导神经群体中的提取实体，可有效改变网络对相关概念的生成。本研究开辟了可解释性的新路径：通过结合认知原理与自然数据的内在结构，逐步揭示复杂学习系统的隐藏计算机制，使其从黑箱转变为可理解的系统。

Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

Abstract

arXiv:2505.11615v1 Announce Type: cross Abstract: Changing the behavior of large language models (LLMs) can be as straightforward as editing the Transformer's residual streams using appropriately constructed "steering vectors." These modifications to internal neural activations, a form of representation engineering, offer an effective and targeted means of influencing model behavior without retraining or fine-tuning the model. But how can such steering vectors be systematically identified? We propose a principled approach for uncovering steering vectors by aligning latent representations elicited through behavioral methods (specifically, Markov chain Monte Carlo with LLMs) with their neural counterparts. To evaluate this approach, we focus on extracting latent risk preferences from LLMs and steering their risk-related outputs using the aligned representations as steering vectors. We show that the resulting steering vectors successfully and reliably modulate LLM outputs in line with the targeted behavior.

摘要

改变大型语言模型（LLM）的行为可以像使用适当构建的"引导向量"编辑Transformer的残差流那样直接。这种对内部神经激活的修改属于表征工程的一种形式，为影响模型行为提供了一种有效且针对性的方法，而无需重新训练或微调模型。但如何系统地识别此类引导向量？我们提出了一种原则性方法，通过将行为方法（特别是基于LLM的马尔可夫链蒙特卡洛）引发的潜在表征与其神经对应部分对齐，来发现引导向量。为评估该方法，我们专注于从LLM中提取潜在风险偏好，并利用对齐后的表征作为引导向量来调控模型的风险相关输出。实验表明，所得引导向量能成功且可靠地按照目标行为调节LLM的输出。

Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Abstract

arXiv:2505.11595v1 Announce Type: cross Abstract: Reinforcement learning (RL) has demonstrated significant success in enhancing reasoning capabilities in large language models (LLMs). One of the most widely used RL methods is Group Relative Policy Optimization (GRPO)~~\cite{Shao-2024-Deepseekmath}, known for its memory efficiency and success in training DeepSeek-R1~~\cite{Guo-2025-Deepseek}. However, GRPO stalls when all sampled responses in a group are incorrect -- referred to as an \emph{all-negative-sample} group -- as it fails to update the policy, hindering learning progress. The contributions of this paper are two-fold. First, we propose a simple yet effective framework that introduces response diversity within all-negative-sample groups in GRPO using AI feedback. We also provide a theoretical analysis, via a stylized model, showing how this diversification improves learning dynamics. Second, we empirically validate our approach, showing the improved performance across various model sizes (7B, 14B, 32B) in both offline and online learning settings with 10 benchmarks, including base and distilled variants. Our findings highlight that learning from all-negative-sample groups is not only feasible but beneficial, advancing recent insights from \citet{Xiong-2025-Minimalist}.

摘要

强化学习（RL）在提升大语言模型（LLMs）的推理能力方面已展现出显著成效。其中，组相对策略优化（GRPO）是最广泛使用的RL方法之一，因其内存高效性及在训练DeepSeek-R1中的成功而闻名。然而，当组内所有采样响应均错误时（称为“全负样本组”），GRPO会停滞，因其无法更新策略，从而阻碍学习进程。本文的贡献有两点：首先，我们提出一个简单而有效的框架，通过AI反馈在全负样本组中引入响应多样性，并通过理论分析（基于简化模型）阐明这种多样化如何改善学习动态；其次，我们在10个基准测试（包括基础版和蒸馏版）中实证验证了该方法，展示了其在离线与在线学习环境下对不同规模模型（7B、14B、32B）性能的提升。研究结果表明，从全负样本组中学习不仅可行且有益，这推进了近期关于简约学习的研究见解。

Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs

Abstract

arXiv:2505.11633v1 Announce Type: cross Abstract: This demo paper reports on a new workflow \textit{GhostWriter} that combines the use of Large Language Models and Knowledge Graphs (semantic artifacts) to support navigation through collections. Situated in the research area of Retrieval Augmented Generation, this specific workflow details the creation of local and adaptable chatbots. Based on the tool-suite \textit{EverythingData} at the backend, \textit{GhostWriter} provides an interface that enables querying and ``chatting'' with a collection. Applied iteratively, the workflow supports the information needs of researchers when interacting with a collection of papers, whether it be to gain an overview, to learn more about a specific concept and its context, and helps the researcher ultimately to refine their research question in a controlled way. We demonstrate the workflow for a collection of articles from the \textit{method data analysis} journal published by GESIS -- Leibniz-Institute for the Social Sciences. We also point to further application areas.

摘要

本演示论文报告了一种名为《GhostWriter》的新型工作流程，该流程结合大型语言模型与知识图谱（语义构件）来支持文献集的导航研究。作为检索增强生成领域的具体应用，该工作流程详细阐述了本地化可适配聊天机器人的构建方法。基于后端工具套件《EverythingData》，《GhostWriter》提供了可对文献集进行查询和"对话"的交互界面。通过迭代应用，该流程能有效满足研究人员与论文集合交互时的信息需求——无论是获取领域概览、深入了解特定概念及其语境，还是帮助研究者以可控方式最终完善研究问题。我们以GESIS-莱布尼茨社会科学研究所出版的《method data analysis》期刊文章合集为例进行演示，并指出其更广泛的应用场景。

Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks

Abstract

arXiv:2505.11665v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated impressive performance across a wide range of Natural Language Processing (NLP) tasks. However, ensuring their effectiveness across multiple languages presents unique challenges. Multilingual prompt engineering has emerged as a key approach to enhance LLMs' capabilities in diverse linguistic settings without requiring extensive parameter re-training or fine-tuning. With growing interest in multilingual prompt engineering over the past two to three years, researchers have explored various strategies to improve LLMs' performance across languages and NLP tasks. By crafting structured natural language prompts, researchers have successfully extracted knowledge from LLMs across different languages, making these techniques an accessible pathway for a broader audience, including those without deep expertise in machine learning, to harness the capabilities of LLMs. In this paper, we survey and categorize different multilingual prompting techniques based on the NLP tasks they address across a diverse set of datasets that collectively span around 250 languages. We further highlight the LLMs employed, present a taxonomy of approaches and discuss potential state-of-the-art (SoTA) methods for specific multilingual datasets. Additionally, we derive a range of insights across language families and resource levels (high-resource vs. low-resource), including analyses such as the distribution of NLP tasks by language resource type and the frequency of prompting methods across different language families. Our survey reviews 36 research papers covering 39 prompting techniques applied to 30 multilingual NLP tasks, with the majority of these studies published in the last two years.

摘要

大型语言模型（LLM）在自然语言处理（NLP）各项任务中展现出卓越性能，但确保其跨语言有效性仍面临独特挑战。多语言提示工程已成为增强LLM在多样化语言环境中能力的关键方法，无需进行大量参数重训练或微调。随着过去两三年间多语言提示工程研究兴趣的增长，学者们探索了多种策略以提升LLM跨语言及跨NLP任务的性能。通过构建结构化自然语言提示，研究者已成功从不同语言的LLM中提取知识，使得这些技术成为包括非机器学习专家的更广泛群体利用LLM能力的可行途径。本文基于涵盖约250种语言的多样化数据集，针对其所处理的NLP任务，对多语言提示技术进行了系统梳理与分类。我们进一步列举了所采用的LLM模型，提出方法分类体系，并探讨了特定多语言数据集的潜在前沿（SoTA）方法。此外，我们从语系和资源水平（高资源vs低资源）维度得出系列洞见，包括按语言资源类型划分的NLP任务分布分析，以及不同语系间提示方法使用频率的统计。本综述涵盖36篇研究论文，涉及应用于30项多语言NLP任务的39种提示技术，其中大部分研究发表于近两年内。

Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization

Abstract

arXiv:2505.11695v1 Announce Type: cross Abstract: We introduce Qronos -- a new state-of-the-art post-training quantization algorithm that sequentially rounds and updates neural network weights. Qronos not only explicitly corrects errors due to both weight and activation quantization, but also errors resulting from quantizing previous layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that subsumes and surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an efficient implementation that uses the Cholesky decomposition for solving least-squares problems. We also demonstrate that Qronos is compatible with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent autoregressive language generation models in the Llama3 family; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches.

摘要

我们提出Qronos——一种最先进的训练后量化算法，通过顺序舍入和更新神经网络权重实现优化。该算法不仅能显式修正权重和激活量化导致的误差，还能纠正前层量化引入的误差。我们的迭代算法基于可解释且严谨的优化框架，该框架涵盖并超越了现有数据驱动方法。在每一步迭代中，Qronos通过最优更新规则交替执行误差校正与扩散过程。值得注意的是，我们证明了Qronos可采用Cholesky分解高效求解最小二乘问题。实验表明，Qronos与现有变换技术（如基于Hadamard的非相干处理和权重-激活缩放均衡等）具有良好兼容性。我们在Llama3系列自回归语言生成模型上评估Qronos，结果表明：在对权重、激活和/或KV缓存进行量化时，Qronos始终优于先前最先进的自适应舍入方法。

Abstract

arXiv:2505.11717v1 Announce Type: cross Abstract: Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. Environmental prompt injection attacks manipulate the environment to induce the web agent to perform a specific, attacker-chosen action--referred to as the target action. However, existing attacks suffer from limited effectiveness or stealthiness, or are impractical in real-world settings. In this work, we propose EnvInjection, a new attack that addresses these limitations. Our attack adds a perturbation to the raw pixel values of the rendered webpage, which can be implemented by modifying the webpage's source code. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the target action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple webpage datasets shows that EnvInjection is highly effective and significantly outperforms existing baselines.

摘要

基于多模态大语言模型（MLLM）的网络代理通过与网页环境的截图交互，生成操作指令。环境提示注入攻击通过操纵环境，诱导网络代理执行攻击者选定的特定操作（称为目标操作）。然而，现有攻击方法在有效性或隐蔽性方面存在不足，或在实际场景中难以实施。本研究提出EnvInjection攻击方法以解决这些局限性。该攻击通过在渲染网页的原始像素值中添加扰动（可通过修改网页源代码实现），当这些扰动像素映射至截图后，即可诱导网络代理执行目标操作。我们将寻找最优扰动的问题建模为优化问题，其核心挑战在于原始像素值与截图之间的映射关系不可微分，导致难以通过反向传播梯度更新扰动。为此，我们训练神经网络以近似该映射关系，并采用投影梯度下降法求解重构后的优化问题。在多组网页数据集上的实验表明，EnvInjection具有显著的高效性，其性能远超现有基线方法。

Token-Level Uncertainty Estimation for Large Language Model Reasoning

Abstract

arXiv:2505.11737v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a token-level uncertainty estimation framework to enable LLMs to self-assess and self-improve their generation quality in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation to LLM decoding, generating predictive distributions that we use to estimate token-level uncertainties. We then aggregate these uncertainties to reflect semantic uncertainty of the generated sequences. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that our token-level uncertainty metrics strongly correlate with answer correctness and model robustness. Additionally, we explore using uncertainty to directly enhance the model's reasoning performance through multiple generations and the particle filtering algorithm. Our approach consistently outperforms existing uncertainty estimation methods, establishing effective uncertainty estimation as a valuable tool for both evaluating and improving reasoning generation in LLMs.

摘要

尽管大型语言模型（LLMs）已展现出卓越的能力，但其输出质量在不同应用场景中仍存在波动，这使得识别可信响应（尤其在需要多步推理的复杂任务中）变得困难。本文提出一种基于词元级不确定性估计的框架，使LLMs能够在数学推理任务中实现生成质量的自我评估与自我提升。具体而言，我们在LLM解码过程中引入低秩随机权重扰动，生成用于估计词元级不确定性的预测分布，并通过聚合这些不确定性来反映生成序列的语义不确定性。在不同难度的数学推理数据集上的实验表明，我们的词元级不确定性指标与答案正确性及模型鲁棒性呈现强相关性。此外，我们探索了通过多重生成和粒子滤波算法直接利用不确定性提升模型推理性能的方法。本方法在各项实验中均优于现有不确定性估计技术，证实了有效的 uncertainty estimation 可作为评估和改进LLMs推理生成的双重工具。

Efficient Uncertainty Estimation via Distillation of Bayesian Large Language Models

Abstract

arXiv:2505.11731v1 Announce Type: cross Abstract: Recent advances in uncertainty estimation for Large Language Models (LLMs) during downstream adaptation have addressed key challenges of reliability and simplicity. However, existing Bayesian methods typically require multiple sampling iterations during inference, creating significant efficiency issues that limit practical deployment. In this paper, we investigate the possibility of eliminating the need for test-time sampling for LLM uncertainty estimation. Specifically, when given an off-the-shelf Bayesian LLM, we distill its aligned confidence into a non-Bayesian student LLM by minimizing the divergence between their predictive distributions. Unlike typical calibration methods, our distillation is carried out solely on the training dataset without the need of an additional validation dataset. This simple yet effective approach achieves N-times more efficient uncertainty estimation during testing, where N is the number of samples traditionally required by Bayesian LLMs. Our extensive experiments demonstrate that uncertainty estimation capabilities on training data can successfully generalize to unseen test data through our distillation technique, consistently producing results comparable to (or even better than) state-of-the-art Bayesian LLMs.

摘要

近期在大语言模型（LLM）下游适配的不确定性估计方面取得的进展，已针对可靠性与简洁性的关键挑战提出了解决方案。然而，现有贝叶斯方法通常需要在推理阶段进行多次采样迭代，导致显著的效率问题，限制了实际部署。本文探讨了在LLM不确定性估计中消除测试阶段采样需求的可行性。具体而言，当给定一个现成的贝叶斯LLM时，我们通过最小化其预测分布之间的差异，将其校准后的置信度蒸馏至一个非贝叶斯学生LLM中。与典型校准方法不同，我们的蒸馏过程仅需训练数据集，无需额外验证集。这种简洁而高效的方法在测试阶段实现了N倍效率提升的不确定性估计（N为传统贝叶斯LLM所需的采样次数）。大量实验表明，通过我们的蒸馏技术，训练数据上的不确定性估计能力能成功泛化至未见测试数据，持续产生与最先进贝叶斯LLM相当（甚至更优）的结果。

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

Abstract

arXiv:2505.11756v1 Announce Type: cross Abstract: It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Our work shows there remain fundamental issues with SAEs, but we are hopeful that that highlighting feature hedging will catalyze future advances that allow SAEs to achieve their full potential of interpreting LLMs at scale.

摘要

假设稀疏自编码器（SAEs）能够将多义性激活分解为可解释的线性方向，前提是这些激活由底层特征的稀疏线性组合构成。然而，我们发现当SAE的宽度小于训练数据中"真实特征"的数量且特征间存在相关性时，SAE会将相关特征的成分合并，从而破坏单义性。在LLM的SAEs中，这两个条件几乎必然成立。这种现象——我们称之为特征对冲——由SAE的重构损失引起，且SAE越窄，问题越严重。本研究首次提出特征对冲问题，并通过玩具模型的理论分析和LLM上训练的SAEs实证研究进行探讨。我们推测特征对冲可能是SAEs持续表现逊于监督基线的核心原因之一。最后，基于对特征对冲的理解，我们提出改进型的套娃SAE变体。这项工作表明SAEs仍存在根本性问题，但我们相信揭示特征对冲现象将推动未来技术进步，使SAEs最终实现大规模解释LLMs的全部潜力。

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Abstract

arXiv:2505.11770v1 Announce Type: cross Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

摘要

当前的可解释性研究提供了多种识别神经网络中抽象内部机制的技术。这些技术能否用于预测模型在分布外样本上的行为？本研究对该问题给出了肯定答案。通过一系列多样化的语言建模任务（包括符号操作、知识检索和指令遵循），我们发现最稳健的正确性预测特征正是那些在模型行为中起独特因果作用的特征。具体而言，我们提出两种利用因果机制预测模型输出正确性的方法：反事实模拟（检验关键因果变量是否实现）和数值探测（利用这些变量的值进行预测）。两种方法在分布内均取得高AUC-ROC值，且在分布外场景下显著优于依赖因果无关特征的方法——而预测模型行为在分布外场景中更为关键。因此，我们的工作揭示了语言模型内部因果分析的一项新颖且重要的应用价值。

Token Masking Improves Transformer-Based Text Classification

Abstract

arXiv:2505.11746v1 Announce Type: cross Abstract: While transformer-based models achieve strong performance on text classification, we explore whether masking input tokens can further enhance their effectiveness. We propose token masking regularization, a simple yet theoretically motivated method that randomly replaces input tokens with a special [MASK] token at probability p. This introduces stochastic perturbations during training, leading to implicit gradient averaging that encourages the model to capture deeper inter-token dependencies. Experiments on language identification and sentiment analysis -- across diverse models (mBERT, Qwen2.5-0.5B, TinyLlama-1.1B) -- show consistent improvements over standard regularization techniques. We identify task-specific optimal masking rates, with p = 0.1 as a strong general default. We attribute the gains to two key effects: (1) input perturbation reduces overfitting, and (2) gradient-level smoothing acts as implicit ensembling.

摘要

尽管基于Transformer的模型在文本分类任务中表现出色，本研究探讨了掩码输入标记是否能进一步提升其效能。我们提出了一种简单但具有理论依据的标记掩码正则化方法，该方法以概率p随机将输入标记替换为特殊的[MASK]标记。这种技术在训练过程中引入了随机扰动，通过隐式梯度平均机制促使模型学习更深层次的标记间依赖关系。在语言识别和情感分析任务上的实验（涵盖多种模型：mBERT、Qwen2.5-0.5B、TinyLlama-1.1B）表明，该方法相较于标准正则化技术实现了稳定提升。研究发现不同任务存在特定的最优掩码率，其中p=0.1可作为普适性较强的默认值。性能提升主要归因于两个关键效应：(1) 输入扰动有效抑制过拟合，(2) 梯度层面的平滑作用实现了隐式集成效果。

Towards Universal Semantics With Large Language Models

Abstract

arXiv:2505.11764v1 Announce Type: cross Abstract: The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond.

摘要

自然语义元语言（NSM）是一种基于普遍语义基元的语言学理论，这些语义基元是简单、原始的词汇意义，已被证实在全球绝大多数（若非全部）语言中都存在。根据该框架，任何词汇无论复杂度如何，均可通过这些基元进行释义，从而揭示清晰且具普遍可译性的含义。此类释义（称为"语义解析"）可为众多自然语言处理（NLP）任务提供重要应用，但传统生成过程缓慢且依赖人工。本研究首次探讨利用大语言模型（LLM）生成NSM语义解析的方法，提出了自动评估方案、专用于训练与评估的数据集，以及针对该任务优化的微调模型。我们的10亿和80亿参数模型在生成准确、具跨语言可译性的语义解析方面优于GPT-4o，标志着LLM实现通用语义表征的重要进展，为语义分析、翻译等应用开辟了新途径。

ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Abstract

arXiv:2505.11739v1 Announce Type: cross Abstract: Recently, training-free methods for improving large language models (LLMs) have attracted growing interest, with token-level attention tuning emerging as a promising and interpretable direction. However, existing methods typically rely on auxiliary mechanisms to identify important or irrelevant task-specific tokens, introducing potential bias and limiting applicability. In this paper, we uncover a surprising and elegant alternative: the semantically empty initial token is a powerful and underexplored control point for optimizing model behavior. Through theoretical analysis, we show that tuning the initial token's attention sharpens or flattens the attention distribution over subsequent tokens, and its role as an attention sink amplifies this effect. Empirically, we find that: (1) tuning its attention improves LLM performance more effectively than tuning other task-specific tokens; (2) the effect follows a consistent trend across layers, with earlier layers having greater impact, but varies across attention heads, with different heads showing distinct preferences in how they attend to this token. Based on these findings, we propose ZeroTuning, a training-free approach that improves LLM performance by applying head-specific attention adjustments to this special token. Despite tuning only one token, ZeroTuning achieves higher performance on text classification, multiple-choice, and multi-turn conversation tasks across models such as Llama, Qwen, and DeepSeek. For example, ZeroTuning improves Llama-3.1-8B by 11.71% on classification, 2.64% on QA tasks, and raises its multi-turn score from 7.804 to 7.966. The method is also robust to limited resources, few-shot settings, long contexts, quantization, decoding strategies, and prompt variations. Our work sheds light on a previously overlooked control point in LLMs, offering new insights into both inference-time tuning and model interpretability.

摘要

近期，提升大语言模型（LLM）性能的无训练方法日益受到关注，其中基于词元级注意力调节的技术因其可解释性成为重要研究方向。然而现有方法通常依赖辅助机制识别任务相关重要或无关词元，可能引入偏差且适用性有限。本文揭示了一种简洁而高效的替代方案：语义空缺的初始词元作为模型行为优化的控制点具有未被充分挖掘的潜力。理论分析表明，调节初始词元的注意力会锐化或平滑后续词元的注意力分布，其作为"注意力汇聚点"的特性可放大该效应。实验发现：（1）调节初始词元注意力比调节任务相关词元更能有效提升模型性能；（2）该效应在模型各层呈现一致性（浅层影响更大），但在不同注意力头中表现各异（各头对该词元的关注偏好不同）。基于此，我们提出ZeroTuning方法——通过对该特殊词元实施头部特异性注意力调节来实现无训练的LLM性能提升。尽管仅调节单个词元，ZeroTuning在Llama、Qwen和DeepSeek等模型的文本分类、多选问答及多轮对话任务中均取得更优表现。例如Llama-3.1-8B模型在分类任务上提升11.71%，QA任务提升2.64%，多轮对话评分从7.804升至7.966。该方法对资源限制、少样本场景、长上下文、量化处理、解码策略及提示词变化均表现鲁棒。本研究为LLM中一个长期被忽视的控制点提供了新见解，对推理时调优和模型可解释性研究具有双重启示意义。

Retrospex: Language Agent Meets Offline Reinforcement Learning Critic

Abstract

arXiv:2505.11807v1 Announce Type: cross Abstract: Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM's context. Instead, it combines the LLM's action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ''retrospection'' process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.

摘要

大语言模型（LLMs）具备广泛的知识和常识推理能力，这使其成为构建强大智能体的重要基础。然而，现有LLM智能体框架未能充分利用历史经验进行自我改进。本研究提出了一种名为Retrospex的新型基于LLM的智能体框架，通过深度分析历史经验来解决这一挑战。与先前方法不同，Retrospex并不直接将经验整合到LLM的上下文中，而是将LLM的动作似然与强化学习（RL）评论家估算的动作价值相结合——该评论家通过离线'回溯'过程在历史经验上进行训练。此外，Retrospex采用动态动作重评分机制，对于需要更多环境交互的任务，会提升基于经验的价值权重。我们在ScienceWorld、ALFWorld和Webshop环境中对Retrospex进行了评估，结果表明其优于当前先进的基线方法。

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

Abstract

arXiv:2505.11774v1 Announce Type: cross Abstract: Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.

摘要

大型语言模型（LLMs）在数学问题求解方面展现出显著进展，但现有评估主要集中于具有精确解析解或涉及形式化证明的问题，往往忽略了应用科学与工程中普遍存在的基于近似求解的问题。为填补这一空白，我们在前期工作基础上提出HARDMath2数据集——包含211道原创题目，涵盖研究生应用数学导论课程核心内容，包括边界层分析、WKB方法、非线性偏微分方程的渐近解以及振荡积分的渐近分析。该数据集由哈佛大学研究生应用数学核心课程的师生共同设计与验证。我们通过新型协作环境构建该数据集：要求学生根据课程大纲编写并优化难题，进行同伴验证解答，测试不同模型，并自动对比LLM生成解与学生答案及数值基准真值。评估结果表明，前沿领先模型仍难以解决数据集中多数问题，揭示了当前LLMs数学推理能力的缺陷。值得注意的是，学生通过模型交互并利用常见失效模式，总结出创建高难度问题的策略。这种与模型的往复互动不仅催生出更丰富、更具挑战性的基准测试，还促使学生对课程内容的理解获得质性提升——这一教育价值在当今时代尤为重要，因为最先进语言模型已能解决众多领域的大量难题。

CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning

Abstract

arXiv:2505.11830v1 Announce Type: cross Abstract: System2 reasoning is developing rapidly these days with the emergence of Deep- Thinking Models and chain-of-thought technology, which has become a centralized discussion point in the AI community. However, there is a relative gap in the research on complex video reasoning at present. In this work, we propose CoT-Vid, a novel training-free paradigm for the video domain with a multistage complex reasoning design. Distinguishing from existing video LLMs, which rely heavily on perceptual abilities, it achieved surprising performance gain with explicit reasoning mechanism. The paradigm consists of three main components: dynamic inference path routing, problem decoupling strategy, and video self-consistency verification. In addition, we propose a new standard for categorization of video questions. CoT- Vid showed outstanding results on a wide range of benchmarks, and outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, rivalling or even surpassing larger and proprietary models, such as GPT-4V, GPT-4o and Gemini-1.5-flash. Our codebase will be publicly available soon.

摘要

随着深度思维模型和思维链技术的兴起，系统2推理近年来发展迅速，已成为人工智能领域的核心议题。然而当前针对复杂视频推理的研究仍存在相对空白。本研究提出CoT-Vid——一种面向视频领域的新型免训练范式，其多阶段复杂推理设计显著区别于现有视频大语言模型对感知能力的重度依赖，通过显式推理机制实现了惊人的性能提升。该范式包含三大核心组件：动态推理路径路由、问题解耦策略及视频自洽验证。此外，我们提出了视频问题分类的新标准。CoT-Vid在多项基准测试中表现优异，在Egochema上较基线模型提升9.3%，在VideoEspresso上提升5.6%，其性能可媲美甚至超越GPT-4V、GPT-4o和Gemini-1.5-flash等更大规模的专有模型。代码库即将公开。

Are vision language models robust to uncertain inputs?

Abstract

arXiv:2505.11804v1 Announce Type: cross Abstract: Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. Testing models using two classic uncertainty quantification tasks, anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs. Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from uncertain predictions enables significant reliability gains, achieving near-perfect robustness in several settings. However, for domain-specific tasks such as galaxy morphology classification, a lack of specialized knowledge prevents reliable uncertainty estimation. Finally, we propose a novel mechanism based on caption diversity to reveal a model's internal uncertainty, enabling practitioners to predict when models will successfully abstain without relying on labeled data.

摘要

针对不确定性和模糊输入的鲁棒性是深度学习模型面临的关键挑战。虽然近期大规模视觉语言模型（VLM，如GPT4o）的进展可能暗示增大模型和训练数据集规模能缓解该问题，但我们的实证评估揭示了更复杂的情况。通过使用异常检测和固有模糊条件下的分类这两个经典不确定性量化任务进行测试，我们发现较新的大型VLM相比早期模型确实表现出更强的鲁棒性，但仍存在严格遵循指令的倾向，这常导致其在面对不明确或异常输入时产生过度自信的幻觉响应。值得注意的是，对于ImageNet等自然图像，无需修改流程即可克服该局限：仅需提示模型对不确定预测保持回避，即可显著提升可靠性，在多种设置下实现近乎完美的鲁棒性。然而对于星系形态分类等特定领域任务，专业知识的缺乏会阻碍可靠的不确定性估计。最后，我们提出一种基于描述多样性的新机制来揭示模型内部不确定性，使实践者无需依赖标注数据即可预测模型何时能成功保持回避。

Search-Based Correction of Reasoning Chains for Language Models

Abstract

arXiv:2505.11824v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we introduce a new self-correction framework that augments each reasoning step in a CoT with a latent variable indicating its veracity, enabling modeling of all possible truth assignments rather than assuming correctness throughout. To efficiently explore this expanded space, we introduce Search Corrector, a discrete search algorithm over boolean-valued veracity assignments. It efficiently performs otherwise intractable inference in the posterior distribution over veracity assignments by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time correction method facilitates supervised fine-tuning of an Amortized Corrector by providing pseudo-labels for veracity. The Amortized Corrector generalizes self-correction, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that Search Corrector reliably identifies errors in logical (ProntoQA) and mathematical reasoning (GSM8K) benchmarks. The Amortized Corrector achieves comparable zero-shot accuracy and improves final answer accuracy by up to 25%.

摘要

链式思考（CoT）推理技术提升了语言模型（LM）的能力与透明度，但其推理链中可能存在错误陈述，从而降低性能与可信度。为解决该问题，我们提出了一种新型自校正框架：通过为CoT的每个推理步骤引入表征真实性的隐变量，实现对所有可能真值分配的建模，而非默认全程正确。为高效探索这一扩展空间，我们开发了"搜索校正器"——一种基于布尔真值分配的离散搜索算法。该算法通过利用语言模型对真值与最终答案的联合似然作为代理奖励，在后验真值分配分布中实现了原本难以处理的高效推理。这种高效的推理时校正方法通过为真值分配提供伪标签，支持了"摊销校正器"的监督微调。摊销校正器可泛化自校正能力，实现新场景下的零样本真值推理。实验结果表明：搜索校正器在逻辑推理（ProntoQA）和数学推理（GSM8K）基准测试中能可靠识别错误；摊销校正器在零样本准确率上达到可比水平，并将最终答案准确率最高提升25%。

On Membership Inference Attacks in Knowledge Distillation

Abstract

arXiv:2505.11837v1 Announce Type: cross Abstract: Nowadays, Large Language Models (LLMs) are trained on huge datasets, some including sensitive information. This poses a serious privacy concern because privacy attacks such as Membership Inference Attacks (MIAs) may detect this sensitive information. While knowledge distillation compresses LLMs into efficient, smaller student models, its impact on privacy remains underexplored. In this paper, we investigate how knowledge distillation affects model robustness against MIA. We focus on two questions. First, how is private data protected in teacher and student models? Second, how can we strengthen privacy preservation against MIAs in knowledge distillation? Through comprehensive experiments, we show that while teacher and student models achieve similar overall MIA accuracy, teacher models better protect member data, the primary target of MIA, whereas student models better protect non-member data. To address this vulnerability in student models, we propose 5 privacy-preserving distillation methods and demonstrate that they successfully reduce student models' vulnerability to MIA, with ensembling further stabilizing the robustness, offering a reliable approach for distilling more secure and efficient student models. Our implementation source code is available at https://github.com/richardcui18/MIA_in_KD.

摘要

当前，大规模语言模型（LLM）的训练依赖于包含敏感信息的海量数据集，这引发了严重的隐私担忧——成员推断攻击（MIA）等隐私攻击可能探测到此类敏感信息。尽管知识蒸馏技术能将LLM压缩为高效的小型学生模型，但其对隐私保护的影响尚未得到充分研究。本文系统探究了知识蒸馏如何影响模型抵御MIA的鲁棒性，重点解决两个核心问题：其一，师生模型中私有数据的保护机制有何差异？其二，如何增强知识蒸馏过程中对抗MIA的隐私保护能力？通过全面实验，我们发现虽然师生模型的总体MIA准确率相近，但教师模型能更好地保护MIA主要攻击目标（成员数据），而学生模型更擅长保护非成员数据。针对学生模型的这一脆弱性，我们提出五种隐私保护蒸馏方法，并证明其能有效降低学生模型受MIA攻击的风险，其中集成方法进一步提升了鲁棒性的稳定性，为蒸馏更安全高效的学生模型提供了可靠方案。项目源代码已开源：https://github.com/richardcui18/MIA_in_KD。

Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning

Abstract

arXiv:2505.11827v1 Announce Type: cross Abstract: Compressing long chain-of-thought (CoT) from large language models (LLMs) is an emerging strategy to improve the reasoning efficiency of LLMs. Despite its promising benefits, existing studies equally compress all thoughts within a long CoT, hindering more concise and effective reasoning. To this end, we first investigate the importance of different thoughts by examining their effectiveness and efficiency in contributing to reasoning through automatic long CoT chunking and Monte Carlo rollouts. Building upon the insights, we propose a theoretically bounded metric to jointly measure the effectiveness and efficiency of different thoughts. We then propose Long $\otimes$ Short, an efficient reasoning framework that enables two LLMs to collaboratively solve the problem: a long-thought LLM for more effectively generating important thoughts, while a short-thought LLM for efficiently generating remaining thoughts. Specifically, we begin by synthesizing a small amount of cold-start data to fine-tune LLMs for long-thought and short-thought reasoning styles, respectively. Furthermore, we propose a synergizing-oriented multi-turn reinforcement learning, focusing on the model self-evolution and collaboration between long-thought and short-thought LLMs. Experimental results show that our method enables Qwen2.5-7B and Llama3.1-8B to achieve comparable performance compared to DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, while reducing token length by over 80% across the MATH500, AIME24/25, AMC23, and GPQA Diamond benchmarks. Our data and code are available at https://github.com/yasNing/Long-otimes-Short/.

摘要

压缩大语言模型（LLMs）的长思维链（CoT）是提升其推理效率的新兴策略。尽管前景广阔，现有研究均等地压缩长CoT中的所有思维单元，限制了更简洁高效的推理实现。为此，我们首先通过自动长CoT分块和蒙特卡洛推演，探究不同思维单元对推理贡献的有效性与效率差异。基于这些发现，我们提出一个理论有界的联合度量指标，用以评估不同思维单元的有效性与效率。继而提出Long $\otimes$ Short框架——通过双LLM协同解题：长思维LLM专注生成关键思维单元，短思维LLM高效处理剩余单元。具体实现包括：1）合成少量冷启动数据分别微调长/短思维推理风格的LLM；2）设计面向协同进化的多轮强化学习机制，聚焦模型自我进化与长短思维LLM的协作。实验表明，本方法使Qwen2.5-7B和Llama3.1-8B在MATH500、AIME24/25、AMC23及GPQA Diamond基准测试中达到与DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B相当的性能，同时减少超过80%的token消耗。数据与代码详见https://github.com/yasNing/Long-otimes-Short/。

SplInterp: Improving our Understanding and Training of Sparse Autoencoders

Abstract

arXiv:2505.11836v1 Announce Type: cross Abstract: Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise $k$-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal $k$ -means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: https://github.com/splInterp2025/splInterp

摘要

稀疏自编码器（SAEs）作为机制可解释性工具近期受到广泛关注，其在从超大规模语言模型中提取可解释特征方面展现出显著成效。然而，该领域研究主要基于实证，近期对SAEs实际效用的质疑逐渐显现。本研究基于深度学习样条理论，旨在深化对SAEs的理论认知。通过将该框架应用于SAEs，我们发现：SAEs将"k均值自编码器"推广为分段仿射形式，但与最优的"类k均值加局部主成分分析（PCA）"分段仿射自编码器相比，其以牺牲精度为代价换取可解释性。我们利用幂图刻画了（TopK）SAEs的底层几何结构，并提出新型近端交替随机梯度下降（PAM-SGD）算法用于SAEs训练——该算法不仅具有扎实的理论基础，在MNIST和LLM实验中更展现出优异的样本效率（在LLM场景下编码稀疏性显著提升）等实证表现。全部代码已开源：https://github.com/splInterp2025/splInterp

Multilingual Collaborative Defense for Large Language Models

Abstract

arXiv:2505.11835v1 Announce Type: cross Abstract: The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at https://github.com/HLiang-Lee/MCD.

摘要

大型语言模型（LLMs）的鲁棒性与安全性已成为重要研究领域。一个显著漏洞是通过将有害查询翻译为罕见或低资源语言来绕过LLM防护机制，这种简单却有效的"越狱"方法日益引发关注。然而针对多语言场景下LLM防护的研究仍显不足，突显了增强多语言安全性的迫切需求。本研究探究了不同语言间攻击特征的相关性，提出多语言协同防御（MCD）——一种通过自动优化连续软安全提示来实现LLM多语言防护的新型学习方法。MCD具有三大优势：首先显著提升多语言防护性能；其次在保持强泛化能力的同时降低误拒率；第三缓解由LLM训练语料不均衡导致的语言安全失准问题。为评估MCD效果，我们手动构建了MaliciousInstruct、AdvBench等常用越狱基准的多语言版本，并引入零样本低资源语言场景验证其语言迁移能力。实验表明MCD在防御多语言越狱攻击时优于现有方法，同时展现出强大的跨语言迁移能力。代码已开源：https://github.com/HLiang-Lee/MCD。

RLAP: A Reinforcement Learning Enhanced Adaptive Planning Framework for Multi-step NLP Task Solving

Abstract

arXiv:2505.11893v1 Announce Type: cross Abstract: Multi-step planning has been widely employed to enhance the performance of large language models (LLMs) on downstream natural language processing (NLP) tasks, which decomposes the original task into multiple subtasks and guide LLMs to solve them sequentially without additional training. When addressing task instances, existing methods either preset the order of steps or attempt multiple paths at each step. However, these methods overlook instances' linguistic features and rely on the intrinsic planning capabilities of LLMs to evaluate intermediate feedback and then select subtasks, resulting in suboptimal outcomes. To better solve multi-step NLP tasks with LLMs, in this paper we propose a Reinforcement Learning enhanced Adaptive Planning framework (RLAP). In our framework, we model an NLP task as a Markov decision process (MDP) and employ an LLM directly into the environment. In particular, a lightweight Actor model is trained to estimate Q-values for natural language sequences consisting of states and actions through reinforcement learning. Therefore, during sequential planning, the linguistic features of each sequence in the MDP can be taken into account, and the Actor model interacts with the LLM to determine the optimal order of subtasks for each task instance. We apply RLAP on three different types of NLP tasks and conduct extensive experiments on multiple datasets to verify RLAP's effectiveness and robustness.

摘要

多步规划已被广泛应用于提升大语言模型（LLM）在下游自然语言处理（NLP）任务中的表现，该方法将原始任务分解为多个子任务，并引导LLM在不额外训练的情况下依次解决。现有方法在处理任务实例时，要么预设步骤顺序，要么在每一步尝试多种路径。然而，这些方法忽视了实例的语言特征，仅依赖LLM固有的规划能力来评估中间反馈并选择子任务，导致结果欠佳。为更好地利用LLM解决多步NLP任务，本文提出一种强化学习增强的自适应规划框架（RLAP）。在该框架中，我们将NLP任务建模为马尔可夫决策过程（MDP），并将LLM直接嵌入环境。具体而言，通过强化学习训练一个轻量级Actor模型，用于估计由状态和动作组成的自然语言序列的Q值。因此，在顺序规划过程中，MDP中每个序列的语言特征均可被纳入考量，Actor模型与LLM交互以确定每个任务实例的最优子任务顺序。我们在三类不同的NLP任务上应用RLAP，并在多个数据集上进行大量实验，验证了RLAP的有效性和鲁棒性。

An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts

Abstract

arXiv:2505.11924v1 Announce Type: cross Abstract: We provide an explanation for the performance gains of intrinsic self-correction, a process where a language model iteratively refines its outputs without external feedback. More precisely, we investigate how prompting induces interpretable changes in hidden states and thus affects the output distributions. We hypothesize that each prompt-induced shift lies in a linear span of some linear representation vectors, naturally separating tokens based on individual concept alignment. Building around this idea, we give a mathematical formulation of self-correction and derive a concentration result for output tokens based on alignment magnitudes. Our experiments on text detoxification with zephyr-7b-sft reveal a substantial gap in the inner products of the prompt-induced shifts and the unembeddings of the top-100 most toxic tokens vs. those of the unembeddings of the bottom-100 least toxic tokens, under toxic instructions. This suggests that self-correction prompts enhance a language model's capability of latent concept recognition. Our analysis offers insights into the underlying mechanism of self-correction by characterizing how prompting works explainably. For reproducibility, our code is available.

摘要

我们针对语言模型内在自我校正（无需外部反馈即可迭代优化输出）的性能提升机制提出解释。具体而言，本研究探究提示如何引发隐藏状态的可解释性变化并影响输出分布。我们假设每个提示诱导的偏移量均位于某些线性表示向量的线性张成空间中，从而基于个体概念对齐实现词汇的自然区分。围绕该假设，我们建立自我校正的数学模型，并基于对齐强度推导出输出词汇的集中性结果。在zephyr-7b-sft模型上进行的文本脱毒实验表明：在毒性指令下，提示诱导偏移量与毒性最高100个词汇的反嵌入向量、以及与非毒性最低100个词汇反嵌入向量的内积值存在显著差异。这表明自我校正提示能增强语言模型对潜在概念的识别能力。通过可解释地刻画提示机制的工作原理，我们的分析揭示了自我校正的内在机理。为确保可复现性，相关代码已开源。

SafeVid: Toward Safety Aligned Video Large Multimodal Models

Abstract

arXiv:2505.11926v1 Announce Type: cross Abstract: As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SafeVid, a framework designed to instill video-specific safety principles in VLMMs. SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SafeVid provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs. We have made SafeVid-350K dataset (https://huggingface.co/datasets/yxwang/SafeVid-350K) publicly available.

摘要

随着视频大型多模态模型（VLMMs）的快速发展，其固有复杂性带来了严峻的安全挑战，尤其是静态安全对齐无法迁移到动态视频场景的泛化失配问题。本文提出SafeVid框架，旨在为VLMMs注入视频专属安全原则。该框架通过将文本视频描述作为解释桥梁，将稳健的文本安全对齐能力独特地迁移至视频领域，实现基于LLM的规则驱动安全推理。这一闭环系统包含三个关键环节：1）构建包含35万对视频安全偏好数据的新数据集SafeVid-350K；2）采用直接偏好优化（DPO）对VLMMs进行针对性对齐；3）通过新开发的SafeVidBench基准进行全面评估。实验表明，基于SafeVid-350K的对齐显著提升了VLMMs安全性，以LLaVA-NeXT-Video为代表的模型在SafeVidBench上取得最高42.39%的性能提升。SafeVid不仅提供了关键资源，更通过结构化方法证明：利用文本描述作为安全推理传导媒介，可显著改善VLMMs的安全对齐效果。我们已公开SafeVid-350K数据集（https://huggingface.co/datasets/yxwang/SafeVid-350K）。

AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning

Abstract

arXiv:2505.11896v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.

摘要

大型语言模型（LLMs）已展现出卓越的能力，但在需要复杂推理的任务中仍面临挑战。尽管思维链（CoT）提示显著增强了推理能力，但它会 indiscriminately 为所有查询生成冗长的推理步骤，导致巨大的计算成本和效率低下，尤其对于较简单的输入。为解决这一关键问题，我们提出了AdaCoT（自适应思维链），这是一种新颖的框架，使LLMs能够自适应地决定何时调用CoT。AdaCoT将自适应推理视为一个帕累托优化问题，旨在平衡模型性能与CoT调用相关的成本（包括频率和计算开销）。我们提出了一种基于强化学习（RL）的方法，特别是利用近端策略优化（PPO），通过调整惩罚系数动态控制CoT触发决策边界，从而使模型能够根据隐式查询复杂度确定CoT的必要性。一个关键的技术贡献是选择性损失掩码（SLM），旨在防止多阶段RL训练期间的决策边界崩溃，确保稳健且稳定的自适应触发。实验结果表明，AdaCoT成功地在帕累托前沿上导航，对于不需要复杂推理的查询，显著减少了CoT的使用。例如，在我们的生产流量测试集上，AdaCoT将CoT触发率降低至3.18%，并将平均响应令牌数减少了69.06%，同时在复杂任务上保持高性能。

Fine-Grained ECG-Text Contrastive Learning via Waveform Understanding Enhancement

Abstract

arXiv:2505.11939v1 Announce Type: cross Abstract: Electrocardiograms (ECGs) are essential for diagnosing cardiovascular diseases. While previous ECG-text contrastive learning methods have shown promising results, they often overlook the incompleteness of the reports. Given an ECG, the report is generated by first identifying key waveform features and then inferring the final diagnosis through these features. Despite their importance, these waveform features are often not recorded in the report as intermediate results. Aligning ECGs with such incomplete reports impedes the model's ability to capture the ECG's waveform features and limits its understanding of diagnostic reasoning based on those features. To address this, we propose FG-CLEP (Fine-Grained Contrastive Language ECG Pre-training), which aims to recover these waveform features from incomplete reports with the help of large language models (LLMs), under the challenges of hallucinations and the non-bijective relationship between waveform features and diagnoses. Additionally, considering the frequent false negatives due to the prevalence of common diagnoses in ECGs, we introduce a semantic similarity matrix to guide contrastive learning. Furthermore, we adopt a sigmoid-based loss function to accommodate the multi-label nature of ECG-related tasks. Experiments on six datasets demonstrate that FG-CLEP outperforms state-of-the-art methods in both zero-shot prediction and linear probing across these datasets.

摘要

心电图（ECG）是诊断心血管疾病的重要工具。尽管现有的ECG-文本对比学习方法已取得显著成果，但这些方法往往忽略了报告的不完整性。给定一份心电图，报告的生成通常包含两个步骤：首先识别关键波形特征，随后通过这些特征推导最终诊断结论。然而这些关键波形特征作为中间结果却常未被记录在报告中。将心电图与这类不完整报告直接对齐，会阻碍模型捕捉心电图波形特征的能力，并限制其基于这些特征理解诊断推理的过程。为解决这一问题，我们提出FG-CLEP（细粒度对比语言-心电图预训练）方法，在面临大语言模型幻觉问题及波形特征与诊断结论非双射关系的挑战下，利用大语言模型从不完整报告中还原这些波形特征。针对心电图常见诊断导致的频繁假阴性现象，我们引入语义相似度矩阵来指导对比学习。此外，采用基于Sigmoid的损失函数以适应心电图相关任务的多标签特性。在六个数据集上的实验表明，FG-CLEP在零样本预测和线性探测任务中的表现均优于当前最先进方法。

MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models

Abstract

arXiv:2505.11963v1 Announce Type: cross Abstract: Hardware security verification is a challenging and time-consuming task. For this purpose, design engineers may utilize tools such as formal verification, linters, and functional simulation tests, coupled with analysis and a deep understanding of the hardware design being inspected. Large Language Models (LLMs) have been used to assist during this task, either directly or in conjunction with existing tools. We improve the state of the art by proposing MARVEL, a multi-agent LLM framework for a unified approach to decision-making, tool use, and reasoning. MARVEL mimics the cognitive process of a designer looking for security vulnerabilities in RTL code. It consists of a supervisor agent that devises the security policy of the system-on-chips (SoCs) using its security documentation. It delegates tasks to validate the security policy to individual executor agents. Each executor agent carries out its assigned task using a particular strategy. Each executor agent may use one or more tools to identify potential security bugs in the design and send the results back to the supervisor agent for further analysis and confirmation. MARVEL includes executor agents that leverage formal tools, linters, simulation tests, LLM-based detection schemes, and static analysis-based checks. We test our approach on a known buggy SoC based on OpenTitan from the Hack@DATE competition. We find that 20 of the 48 issues reported by MARVEL pose security vulnerabilities.

摘要

硬件安全验证是一项具有挑战性且耗时的任务。为此，设计工程师可采用形式化验证、静态检查工具和功能仿真测试等方法，并结合对被测硬件设计的深入分析。大型语言模型（LLMs）已在此过程中被直接或联合现有工具用于辅助工作。我们通过提出MARVEL框架改进了现有技术，该多智能体LLM框架为决策制定、工具使用和推理提供了统一方法。MARVEL模拟了设计者在RTL代码中寻找安全漏洞的认知过程：由监督智能体根据芯片系统（SoCs）的安全文档制定安全策略，并将安全策略验证任务分配给执行智能体。每个执行智能体采用特定策略完成任务，可运用形式化工具、静态检查、仿真测试、基于LLM的检测方案或静态分析检查等一种或多种工具识别设计中的潜在安全缺陷，并将结果反馈给监督智能体进行进一步分析与确认。我们在基于Hack@DATE竞赛中OpenTitan的已知缺陷SoC上进行测试，发现MARVEL报告的48个问题中有20个确实存在安全漏洞。

Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning

Abstract

arXiv:2505.11953v1 Announce Type: cross Abstract: Loss reweighting has shown significant benefits for machine unlearning with large language models (LLMs). However, their exact functionalities are left unclear and the optimal strategy remains an open question, thus impeding the understanding and improvement of existing methodologies. In this paper, we identify two distinct goals of loss reweighting, namely, Saturation and Importance -- the former indicates that those insufficiently optimized data should be emphasized, while the latter stresses some critical data that are most influential for loss minimization. To study their usefulness, we design specific reweighting strategies for each goal and evaluate their respective effects on unlearning. We conduct extensive empirical analyses on well-established benchmarks, and summarize some important observations as follows: (i) Saturation enhances efficacy more than importance-based reweighting, and their combination can yield additional improvements. (ii) Saturation typically allocates lower weights to data with lower likelihoods, whereas importance-based reweighting does the opposite. (iii) The efficacy of unlearning is also largely influenced by the smoothness and granularity of the weight distributions. Based on these findings, we propose SatImp, a simple reweighting method that combines the advantages of both saturation and importance. Empirical results on extensive datasets validate the efficacy of our method, potentially bridging existing research gaps and indicating directions for future research. Our code is available at https://github.com/Puning97/SatImp-for-LLM-Unlearning.

摘要

损失函数重加权技术在大语言模型（LLM）的机器遗忘任务中展现出显著优势。然而，其具体作用机制尚不明确，最优策略仍存在争议，这阻碍了对现有方法的理解与改进。本文提出损失重加权的两个核心目标——饱和性与重要性：前者强调应优先优化未充分训练的数据，后者则关注对损失最小化最具影响力的关键数据。为验证其效用，我们针对每个目标设计了特定重加权策略，并评估其在遗忘任务中的效果。基于成熟基准测试的广泛实证分析，我们得出以下重要结论：（1）基于饱和性的重加权比重要性策略更能提升遗忘效能，二者结合可产生额外增益；（2）饱和性策略通常对低似然数据分配较低权重，而重要性策略则相反；（3）遗忘效果还显著受权重分布的平滑性与粒度影响。基于这些发现，我们提出SatImp方法——一种融合饱和性与重要性优势的简单重加权算法。多数据集实验验证了该方法的有效性，不仅弥合了现有研究缺口，也为未来研究方向提供了启示。代码已开源：https://github.com/Puning97/SatImp-for-LLM-Unlearning。

Personalized Author Obfuscation with Large Language Models

Abstract

arXiv:2505.12090v1 Announce Type: cross Abstract: In this paper, we investigate the efficacy of large language models (LLMs) in obfuscating authorship by paraphrasing and altering writing styles. Rather than adopting a holistic approach that evaluates performance across the entire dataset, we focus on user-wise performance to analyze how obfuscation effectiveness varies across individual authors. While LLMs are generally effective, we observe a bimodal distribution of efficacy, with performance varying significantly across users. To address this, we propose a personalized prompting method that outperforms standard prompting techniques and partially mitigates the bimodality issue.

摘要

本文研究了大型语言模型（LLMs）通过文本复述和写作风格转换来实现作者身份混淆的有效性。与采用整体评估方法不同，我们聚焦于用户层面的性能表现，以分析不同个体作者间的混淆效果差异。尽管LLMs总体上表现良好，但我们观察到其效能呈现双峰分布，不同用户间的性能差异显著。为此，我们提出了一种个性化提示方法，该方法优于标准提示技术，并部分缓解了双峰分布问题。

ABoN: Adaptive Best-of-N Alignment

Abstract

arXiv:2505.12050v1 Announce Type: cross Abstract: Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompts without accounting for differences in alignment difficulty. In this work, we propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently. Motivated by latency concerns, we develop a two-stage algorithm: an initial exploratory phase estimates the reward distribution for each prompt using a small exploration budget, and a second stage adaptively allocates the remaining budget using these estimates. Our method is simple, practical, and compatible with any LM/RM combination. Empirical results on the AlpacaEval dataset for 12 LM/RM pairs and 50 different batches of prompts show that our adaptive strategy consistently outperforms the uniform allocation with the same inference budget. Moreover, our experiments show that our adaptive strategy remains competitive against uniform allocations with 20% larger inference budgets and even improves in performance as the batch size grows.

摘要

测试时对齐方法（如最佳N采样法）的最新进展，提供了一种利用奖励模型（RM）引导语言模型（LM）实现预期行为的简单有效方案。然而，这些方法计算成本较高，特别是在未考虑不同提示间对齐难度差异、统一应用的情况下。本研究提出一种面向最佳N对齐的自适应提示策略，可更高效地分配推理计算资源。基于延迟考量，我们开发了一个两阶段算法：初始探索阶段使用少量探索预算估计每个提示的奖励分布，第二阶段则根据这些估计值自适应分配剩余预算。该方法简单实用，兼容任意LM/RM组合。在AlpacaEval数据集上对12组LM/RM配对和50批不同提示的实证结果表明，在相同推理预算下，本自适应策略始终优于均匀分配方案。此外，实验显示本策略在推理预算增加20%的情况下仍与均匀分配方案保持竞争力，且随着批量增大性能进一步提升。

Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

Abstract

arXiv:2505.12038v1 Announce Type: cross Abstract: Large language models (LLMs) have shown great potential as general-purpose AI assistants across various domains. To fully leverage this potential in specific applications, many companies provide fine-tuning API services, enabling users to upload their own data for LLM customization. However, fine-tuning services introduce a new safety threat: user-uploaded data, whether harmful or benign, can break the model's alignment, leading to unsafe outputs. Moreover, existing defense methods struggle to address the diversity of fine-tuning datasets (e.g., varying sizes, tasks), often sacrificing utility for safety or vice versa. To address this issue, we propose Safe Delta, a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Specifically, Safe Delta estimates the safety degradation, selects delta parameters to maximize utility while limiting overall safety loss, and applies a safety compensation vector to mitigate residual safety loss. Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected.

摘要

大型语言模型（LLMs）作为通用人工智能助手在多个领域展现出巨大潜力。为在特定应用中充分发挥这一潜力，许多公司提供微调API服务，允许用户上传自有数据对LLM进行定制。然而，微调服务引入了新的安全威胁：用户上传的数据无论有害或良性，均可能破坏模型的对齐性，导致不安全输出。现有防御方法难以应对微调数据集的多样性（如规模、任务差异），往往需要牺牲实用性换取安全性，或反之。针对该问题，我们提出Safe Delta——一种安全感知的训练后防御方法，通过调整增量参数（即微调前后的参数变化量）实现防护。具体而言，Safe Delta评估安全退化程度，选择增量参数以在限制总体安全损失的同时最大化实用性，并应用安全补偿向量来消除残余安全损失。通过在四种不同设置的数据集上进行广泛实验，本方法在保持良性数据集带来的实用性增益不受影响的同时，始终确保安全性。

Attribution Projection Calculus: A Novel Framework for Causal Inference in Bayesian Networks

Abstract

arXiv:2505.12094v1 Announce Type: cross Abstract: This paper introduces Attribution Projection Calculus (AP-Calculus), a novel mathematical framework for determining causal relationships in structured Bayesian networks. We investigate a specific network architecture with source nodes connected to destination nodes through intermediate nodes, where each input maps to a single label with maximum marginal probability. We prove that for each label, exactly one intermediate node acts as a deconfounder while others serve as confounders, enabling optimal attribution of features to their corresponding labels. The framework formalizes the dual nature of intermediate nodes as both confounders and deconfounders depending on the context, and establishes separation functions that maximize distinctions between intermediate representations. We demonstrate that the proposed network architecture is optimal for causal inference compared to alternative structures, including those based on Pearl's causal framework. AP-Calculus provides a comprehensive mathematical foundation for analyzing feature-label attributions, managing spurious correlations, quantifying information gain, ensuring fairness, and evaluating uncertainty in prediction models, including large language models. Theoretical verification shows that AP-Calculus not only extends but can also subsume traditional do-calculus for many practical applications, offering a more direct approach to causal inference in supervised learning contexts.

摘要

本文提出归因投影演算（AP-Calculus）这一新型数学框架，用于确定结构化贝叶斯网络中的因果关系。我们研究了一种特定网络架构：源节点通过中间节点连接至目标节点，其中每个输入通过最大边际概率映射至单一标签。我们证明对于每个标签，恰有一个中间节点扮演去混杂因子角色，其余则作为混杂因子，从而实现特征到对应标签的最优归因。该框架形式化中间节点根据上下文兼具混杂因子与去混杂因子的双重性质，并建立了能最大化中间表征差异的分离函数。通过对比基于Pearl因果框架等替代结构，我们论证所提网络架构在因果推断中的最优性。AP-Calculus为分析特征-标签归因、管理伪相关、量化信息增益、确保公平性及评估预测模型（包括大语言模型）不确定性提供了完整数学基础。理论验证表明AP-Calculus不仅能扩展传统do-演算，在许多实际应用中还可将其纳入，为监督学习场景下的因果推断提供了更直接的途径。

Decoding the Mind of Large Language Models: A Quantitative Evaluation of Ideology and Biases

Abstract

arXiv:2505.12183v1 Announce Type: cross Abstract: The widespread integration of Large Language Models (LLMs) across various sectors has highlighted the need for empirical research to understand their biases, thought patterns, and societal implications to ensure ethical and effective use. In this study, we propose a novel framework for evaluating LLMs, focusing on uncovering their ideological biases through a quantitative analysis of 436 binary-choice questions, many of which have no definitive answer. By applying our framework to ChatGPT and Gemini, findings revealed that while LLMs generally maintain consistent opinions on many topics, their ideologies differ across models and languages. Notably, ChatGPT exhibits a tendency to change their opinion to match the questioner's opinion. Both models also exhibited problematic biases, unethical or unfair claims, which might have negative societal impacts. These results underscore the importance of addressing both ideological and ethical considerations when evaluating LLMs. The proposed framework offers a flexible, quantitative method for assessing LLM behavior, providing valuable insights for the development of more socially aligned AI systems.

摘要

大型语言模型（LLMs）在各行业的广泛应用凸显了实证研究的必要性，以理解其偏见、思维模式及社会影响，从而确保伦理且有效的使用。本研究提出了一种评估LLMs的新框架，重点通过对436道无明确答案的二元选择题进行定量分析，揭示其意识形态偏见。将该框架应用于ChatGPT和Gemini后发现，尽管LLMs在多数议题上保持观点一致，但其意识形态在不同模型和语言间存在差异。值得注意的是，ChatGPT倾向于改变观点以迎合提问者的立场。两种模型均表现出存在问题的偏见及不道德或不公平的论断，可能对社会产生负面影响。这些结果强调了在评估LLMs时兼顾意识形态与伦理考量的重要性。所提出的框架为评估LLM行为提供了一种灵活的定量方法，为开发更符合社会需求的AI系统提供了重要见解。

Improving Fairness in LLMs Through Testing-Time Adversaries

Abstract

arXiv:2505.12100v1 Announce Type: cross Abstract: Large Language Models (LLMs) push the bound-aries in natural language processing and generative AI, driving progress across various aspects of modern society. Unfortunately, the pervasive issue of bias in LLMs responses (i.e., predictions) poses a significant and open challenge, hindering their application in tasks involving ethical sensitivity and responsible decision-making. In this work, we propose a straightforward, user-friendly and practical method to mitigate such biases, enhancing the reliability and trustworthiness of LLMs. Our method creates multiple variations of a given sentence by modifying specific attributes and evaluates the corresponding prediction behavior compared to the original, unaltered, prediction/sentence. The idea behind this process is that critical ethical predictions often exhibit notable inconsistencies, indicating the presence of bias. Unlike previous approaches, our method relies solely on forward passes (i.e., testing-time adversaries), eliminating the need for training, fine-tuning, or prior knowledge of the training data distribution. Through extensive experiments on the popular Llama family, we demonstrate the effectiveness of our method in improving various fairness metrics, focusing on the reduction of disparities in how the model treats individuals from different racial groups. Specifically, using standard metrics, we improve the fairness in Llama3 in up to 27 percentage points. Overall, our approach significantly enhances fairness, equity, and reliability in LLM-generated results without parameter tuning or training data modifications, confirming its effectiveness in practical scenarios. We believe our work establishes an important step toward enabling the use of LLMs in tasks that require ethical considerations and responsible decision-making.

摘要

大语言模型（LLMs）不断突破自然语言处理和生成式人工智能的边界，推动现代社会的多方面进步。然而，LLMs响应（即预测）中普遍存在的偏见问题构成了一个重大且开放的挑战，阻碍了其在涉及伦理敏感性和负责任决策任务中的应用。本研究提出了一种简单、用户友好且实用的方法来缓解此类偏见，从而提升LLMs的可靠性和可信度。我们的方法通过修改特定属性生成给定句子的多个变体，并评估其与原始未修改预测/句子相比的对应预测行为。这一过程背后的理念是，关键的伦理预测往往表现出显著的不一致性，表明存在偏见。与以往方法不同，我们的方法仅依赖于前向传递（即测试时对抗），无需训练、微调或对训练数据分布的先验知识。通过对流行的Llama系列模型进行大量实验，我们证明了该方法在改善各种公平性指标方面的有效性，重点关注减少模型对不同种族群体个体处理差异的问题。具体而言，使用标准指标，我们将Llama3的公平性最高提升了27个百分点。总体而言，我们的方法在不调整参数或修改训练数据的情况下，显著提高了LLM生成结果的公平性、公正性和可靠性，证实了其在实际场景中的有效性。我们相信，这项工作为在需要伦理考量和负责任决策的任务中使用LLMs奠定了重要一步。

Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features

Abstract

arXiv:2505.12151v1 Announce Type: cross Abstract: Large language models have recently made great strides in reasoning task performance through chain-of-thought (CoT) strategies trained via reinforcement learning; however, these "reasoning large language models" (RLLMs) remain imperfect reasoners, and understanding the frequencies and causes of their failure modes is important for both users and developers. We test o1-mini, o3-mini, DeepSeek-R1, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, and Grok 3 Mini Beta on graph coloring as a variable-complexity constraint-satisfaction logic problem, and find evidence from both error rate comparisons and CoT/explanation text analysis that RLLMs are prone to hallucinate edges not specified in the prompt's description of the graph. This phenomenon persists across multiple problem complexity levels and semantic frames, and it appears to account for a significant fraction of the incorrect answers from every tested model, and the vast majority of them for some models. Our results indicate that RLLMs may possess broader issues with misrepresentation of problem specifics, and we offer suggestions for design choices to mitigate this weakness.

摘要

大型语言模型近期通过基于强化学习的思维链（CoT）策略在推理任务表现上取得显著进展，然而这些"推理型大语言模型"（RLLMs）仍存在缺陷。理解其故障模式的频率和成因对用户和开发者均具有重要意义。本研究以图着色这一可变复杂度的约束满足逻辑问题为测试基准，对o1-mini、o3-mini、DeepSeek-R1、Claude 3.7 Sonnet、Gemini 2.5 Pro Preview及Grok 3 Mini Beta等模型进行实验。通过错误率比较和思维链/解释文本分析发现，RLLMs普遍存在幻觉现象——即会虚构提示文本中未指定的图边关系。该现象在不同问题复杂度层级和语义框架下持续存在，在所有测试模型的错误答案中均占显著比例，某些模型甚至呈现压倒性多数。研究结果表明RLLMs可能存在更广泛的"问题细节误表征"现象，我们针对该缺陷提出了若干设计改进建议。

Self-Destructive Language Model

Abstract

arXiv:2505.12186v1 Announce Type: cross Abstract: Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models' inherent "trainability" on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. (warning: this paper contains potentially harmful content generated by LLMs.)

摘要

有害微调攻击对大型语言模型（LLMs）的安全性构成重大威胁，攻击者仅需少量有害数据即可突破安全防护机制。现有防御方案虽试图强化LLM的对齐性，却未能解决模型对有害数据固有的"可训练性"问题，导致其在更高学习率或更大规模有害数据集攻击下依然脆弱。为突破这一关键局限，我们提出SEAM——一种创新的对齐增强防御方法，通过将LLM转化为具有内在抗错配能力的自毁模型来实现防护。这类模型在保持正常任务性能的同时，会在有害数据微调时出现显著性能退化。该保护机制通过新型损失函数实现，该函数耦合了良性数据与有害数据的优化轨迹，并采用对抗性梯度上升来增强自毁效应。为实现高效训练，我们开发了具有理论误差界的高效无Hessian梯度估计方法。跨LLM和数据集的广泛评估表明，SEAM使攻击者陷入无解困境：自毁模型在低强度攻击下展现最先进的鲁棒性，而在高强度攻击下会发生灾难性性能崩溃，使其完全失效。

Reward Inside the Model: A Lightweight Hidden-State Reward Model for LLM's Best-of-N sampling

Abstract

arXiv:2505.12225v1 Announce Type: cross Abstract: High-quality reward models are crucial for unlocking the reasoning potential of large language models (LLMs), with best-of-N voting demonstrating significant performance gains. However, current reward models, which typically operate on the textual output of LLMs, are computationally expensive and parameter-heavy, limiting their real-world applications. We introduce the Efficient Linear Hidden State Reward (ELHSR) model - a novel, highly parameter-efficient approach that leverages the rich information embedded in LLM hidden states to address these issues. ELHSR systematically outperform baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training. ELHSR also achieves orders-of-magnitude efficiency improvement with significantly less time and fewer FLOPs per sample than baseline reward models. Moreover, ELHSR exhibits robust performance even when trained only on logits, extending its applicability to some closed-source LLMs. In addition, ELHSR can also be combined with traditional reward models to achieve additional performance gains.

摘要

高质量的奖励模型对于释放大语言模型（LLMs）的推理潜能至关重要，其中最佳N选投票机制已展现出显著的性能提升。然而，当前基于LLM文本输出的奖励模型通常计算成本高昂且参数量庞大，限制了其实际应用。本文提出高效线性隐藏状态奖励（ELHSR）模型——一种参数效率极高的创新方法，通过利用LLM隐藏状态中嵌入的丰富信息来解决上述问题。ELHSR系统性地以低于基线模型0.005%的参数量超越基线性能，且仅需少量训练样本。与基线奖励模型相比，ELHSR实现了数量级的效率提升，单个样本所需计算时间和FLOPs显著减少。值得注意的是，ELHSR仅依靠对数概率进行训练时仍保持稳健性能，这扩展了其在某些闭源LLM中的适用性。此外，ELHSR还可与传统奖励模型结合以获取额外性能增益。

LLM-DSE: Searching Accelerator Parameters with LLM Agents

Abstract

arXiv:2505.12188v1 Announce Type: cross Abstract: Even though high-level synthesis (HLS) tools mitigate the challenges of programming domain-specific accelerators (DSAs) by raising the abstraction level, optimizing hardware directive parameters remains a significant hurdle. Existing heuristic and learning-based methods struggle with adaptability and sample efficiency.We present LLM-DSE, a multi-agent framework designed specifically for optimizing HLS directives. Combining LLM with design space exploration (DSE), our explorer coordinates four agents: Router, Specialists, Arbitrator, and Critic. These multi-agent components interact with various tools to accelerate the optimization process. LLM-DSE leverages essential domain knowledge to identify efficient parameter combinations while maintaining adaptability through verbal learning from online interactions. Evaluations on the HLSyn dataset demonstrate that LLM-DSE achieves substantial $2.55\times$ performance gains over state-of-the-art methods, uncovering novel designs while reducing runtime. Ablation studies validate the effectiveness and necessity of the proposed agent interactions. Our code is open-sourced here: https://github.com/Nozidoali/LLM-DSE.

摘要

尽管高层次综合（HLS）工具通过提升抽象层级缓解了领域专用加速器（DSA）的编程挑战，但硬件指令参数的优化仍是一个重大难题。现有启发式与基于学习的方法在适应性和样本效率方面存在局限。本文提出LLM-DSE——一个专为HLS指令优化设计的多智能体框架。该方法将大语言模型（LLM）与设计空间探索（DSE）相结合，由探索器协调四个智能体：路由器（Router）、专家组（Specialists）、仲裁器（Arbitrator）和评估器（Critic）。这些多智能体组件通过与各类工具交互加速优化过程。LLM-DSE利用关键领域知识识别高效参数组合，同时通过在线交互的语词学习保持适应性。在HLSyn数据集上的评估表明，LLM-DSE以2.55倍的性能提升显著超越现有最优方法，在降低运行时的同时发现新颖设计。消融实验验证了所提出智能体交互机制的有效性与必要性。代码已开源：https://github.com/Nozidoali/LLM-DSE。

Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training

Abstract

arXiv:2505.12236v1 Announce Type: cross Abstract: Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have demonstrated potential in FSRE through in-context learning (ICL), their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively tackle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. \footnote{The code and data are released on https://github.com/UESTC-GQJ/TKRE.

摘要

小样本关系抽取（FSRE）由于标注数据稀缺和现有模型泛化能力有限，仍是一项具有挑战性的任务。尽管大语言模型（LLMs）通过上下文学习（ICL）在FSRE中展现出潜力，但其通用训练目标往往导致任务特定关系抽取的性能欠佳。为克服这些挑战，我们提出TKRE（面向关系抽取的两阶段知识引导预训练框架），该创新框架通过融合生成式与判别式学习范式，将LLMs与传统关系抽取模型协同整合。TKRE包含两项关键创新：（1）利用LLMs生成解释驱动知识和模式约束的合成数据，解决数据稀缺问题；（2）采用掩码跨度语言建模（MSLM）与跨度级对比学习（SCL）相结合的两阶段预训练策略，以增强关系推理和泛化能力。这些组件共同使TKRE能有效处理FSRE任务。在基准数据集上的全面实验验证了TKRE的效能，其创造了FSRE领域的最先进性能，并凸显了在低资源场景中更广泛应用的潜力。代码与数据已发布于https://github.com/UESTC-GQJ/TKRE。

Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Abstract

arXiv:2505.12207v1 Announce Type: cross Abstract: Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 25,026 QA pairs and 15,556 images. The pipeline begins with multi-source data preprocessing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 18 open-source LMMs and 3 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.

摘要

大型多模态模型（LMMs）已在多个领域展现出强大能力，但针对农业遥感（RS）的综合基准测试仍较为匮乏。现有农业遥感场景的基准测试存在明显局限性，主要体现在数据集场景多样性不足以及任务设计过于简化。为填补这一空白，我们提出了AgroMind——一个涵盖空间感知、对象理解、场景理解和场景推理四大任务维度的综合性农业遥感基准测试，包含从作物识别、健康监测到环境分析等共13种任务类型。我们通过整合8个公共数据集和1个私有农田地块数据集，构建了包含25,026个问答对和15,556张图像的高质量评估集。流程始于多源数据预处理，包括数据收集、格式标准化和标注优化；随后通过系统化任务定义生成多样化的农业相关问题；最后利用LMMs进行推理、生成响应并开展细致评估。我们在AgroMind上测试了18个开源LMMs和3个闭源模型，实验结果表明模型性能存在显著差距，尤其在空间推理和细粒度识别方面。值得注意的是，人类表现甚至落后于多个领先的LMMs。通过建立农业遥感标准化评估框架，AgroMind揭示了LMMs在领域知识方面的局限性，并指明了未来研究的关键挑战。数据与代码详见https://rssysu.github.io/AgroMind/。

LightRetriever: A LLM-based Hybrid Retrieval Architecture with 1000x Faster Query Inference

Abstract

arXiv:2505.12260v1 Announce Type: cross Abstract: Large Language Models (LLMs)-based hybrid retrieval uses LLMs to encode queries and documents into low-dimensional dense or high-dimensional sparse vectors. It retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based hybrid retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full-sized LLM on an H800 GPU, our approach achieves over a 1000x speedup for query inference with GPU acceleration, and even a 20x speedup without GPU. Experiments on large-scale retrieval benchmarks demonstrate that our method generalizes well across diverse retrieval tasks, retaining an average of 95% full-sized performance.

摘要

基于大语言模型（LLM）的混合检索系统利用LLM将查询和文档编码为低维稠密或高维稀疏向量，通过向量相似度检索与搜索查询相关的文档。文档可预先离线编码，而查询需实时处理，因此需要高效的在线查询编码器。尽管LLM显著提升了检索能力，但深度参数化的LLM服务会降低查询推理吞吐量，并增加在线部署资源需求。本文提出LightRetriever——一种基于LLM的新型混合检索器，其查询编码器极其轻量。该方法保留完整规模的LLM用于文档编码，但将查询编码的工作量降至不超过一次嵌入查找的水平。与在H800 GPU上运行完整规模LLM相比，我们的方法在GPU加速下可实现超过1000倍的查询推理加速，无GPU时仍能实现20倍加速。大规模检索基准测试表明，该方法在多样化检索任务中泛化性能良好，平均保留95%的完整规模模型性能。

Not All Documents Are What You Need for Extracting Instruction Tuning Data

Abstract

arXiv:2505.12250v1 Announce Type: cross Abstract: Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B

摘要

指令微调能提升大语言模型（LLM）的性能，但其效果高度依赖高质量训练数据。近期研究利用LLM基于种子问答对（QA）合成指令数据，但这类合成指令往往缺乏多样性且与输入种子高度相似，限制了实际应用效果。为此，我们提出从蕴含丰富多样知识的网络语料库中提取指令微调数据。原始方案是检索领域相关文档并提取全部QA对，但面临两大挑战：（1）使用LLM提取全部QA对成本极高；（2）大量提取的QA对可能与下游任务无关，反而损害模型性能。为解决这些问题，我们提出EQUAL框架——通过迭代交替执行文档筛选与高质量QA对提取来实现高效可扩展的数据提取。EQUAL首先基于对比学习得到的嵌入向量对文档聚类，再采用多臂老虎机策略高效识别可能包含有价值QA对的聚类簇。这种迭代方法在显著降低计算成本的同时提升了模型性能。在AutoMathText和StackOverflow数据集上进行的四项下游任务实验表明，EQUAL将计算成本降低5-10倍，并使LLaMA-3.1-8B和Mistral-7B模型的准确率提升2.5%

PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs

Abstract

arXiv:2505.12238v1 Announce Type: cross Abstract: The memorization of sensitive and personally identifiable information (PII) by large language models (LLMs) poses growing privacy risks as models scale and are increasingly deployed in real-world applications. Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of comprehensive, realistic, and ethically sourced datasets reflecting the diversity of sensitive information found on the web. We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis, a large-scale synthetic corpus of 384,789 samples derived from 9,674 synthetic profiles designed to closely emulate the distribution, variety, and context of PII and sensitive data as it naturally occurs in online environments. Our data generation pipeline begins with the construction of internally consistent, multi-attribute human profiles using constrained selection to reflect real-world demographics such as education, health attributes, financial status, etc. Using a combination of zero-shot prompting and OpenAI o3-mini, we generate diverse content types - including wiki-style articles, social media posts, forum discussions, online reviews, comments, and marketplace listings - each embedding realistic, contextually appropriate PII and other sensitive information. We validate the utility of PANORAMA by fine-tuning the Mistral-7B model on 1x, 5x, 10x, and 25x data replication rates with a subset of data and measure PII memorization rates - revealing not only consistent increases with repetition but also variation across content types, highlighting PANORAMA's ability to model how memorization risks differ by context. Our dataset and code are publicly available, providing a much-needed resource for privacy risk assessment, model auditing, and the development of privacy-preserving LLMs.

摘要

大型语言模型（LLMs）对敏感信息及个人身份信息（PII）的记忆行为随着模型规模的扩大和在现实应用中的广泛部署，正引发日益严重的隐私风险。当前针对敏感与PII数据记忆的研究及缓解策略开发受限于缺乏全面、真实且符合伦理的数据集，这些数据集应能反映网络环境中敏感信息的多样性特征。为此，我们提出PANORAMA——基于配置文件的自然在线表征与属性记忆分析系统，这是一个包含384,789个样本的大规模合成语料库，源自9,674个合成配置文件，其设计高度模拟在线环境中自然出现的PII及敏感数据的分布、多样性和上下文特征。我们的数据生成流程首先通过约束选择构建具有内部一致性的多属性人类配置文件，以反映教育程度、健康属性、财务状况等真实世界人口统计特征。结合零样本提示技术和OpenAI o3-mini模型，我们生成了多样化的内容类型（包括维基式文章、社交媒体帖子、论坛讨论、在线评论及市场交易列表），每种类型均嵌入符合上下文语境的真实PII及其他敏感信息。通过使用数据子集以1倍、5倍、10倍和25倍复制率对Mistral-7B模型进行微调，并测量PII记忆率，我们验证了PANORAMA的实用性——实验不仅显示记忆率随重复次数持续增长，还揭示了不同内容类型间的记忆差异，凸显了PANORAMA在建模上下文相关记忆风险方面的能力。本数据集与代码已公开，为隐私风险评估、模型审计及隐私保护型LLM的开发提供了亟需的资源。

LAMeTA: Intent-Aware Agentic Network Optimization via a Large AI Model-Empowered Two-Stage Approach

Abstract

arXiv:2505.12247v1 Announce Type: cross Abstract: Nowadays, Generative AI (GenAI) reshapes numerous domains by enabling machines to create content across modalities. As GenAI evolves into autonomous agents capable of reasoning, collaboration, and interaction, they are increasingly deployed on network infrastructures to serve humans automatically. This emerging paradigm, known as the agentic network, presents new optimization challenges due to the demand to incorporate subjective intents of human users expressed in natural language. Traditional generic Deep Reinforcement Learning (DRL) struggles to capture intent semantics and adjust policies dynamically, thus leading to suboptimality. In this paper, we present LAMeTA, a Large AI Model (LAM)-empowered Two-stage Approach for intent-aware agentic network optimization. First, we propose Intent-oriented Knowledge Distillation (IoKD), which efficiently distills intent-understanding capabilities from resource-intensive LAMs to lightweight edge LAMs (E-LAMs) to serve end users. Second, we develop Symbiotic Reinforcement Learning (SRL), integrating E-LAMs with a policy-based DRL framework. In SRL, E-LAMs translate natural language user intents into structured preference vectors that guide both state representation and reward design. The DRL, in turn, optimizes the generative service function chain composition and E-LAM selection based on real-time network conditions, thus optimizing the subjective Quality-of-Experience (QoE). Extensive experiments conducted in an agentic network with 81 agents demonstrate that IoKD reduces mean squared error in intent prediction by up to 22.5%, while SRL outperforms conventional generic DRL by up to 23.5% in maximizing intent-aware QoE.

摘要

当前，生成式人工智能（GenAI）通过使机器能够跨模态生成内容，正在重塑众多领域。随着GenAI进化为具备推理、协作与交互能力的自主智能体，它们被日益部署于网络基础设施中以实现自动化人类服务。这一新兴范式——智能体网络——由于需要融合人类用户以自然语言表达的主观意图，带来了新的优化挑战。传统通用深度强化学习（DRL）难以捕捉意图语义并动态调整策略，从而导致次优结果。本文提出LAMeTA，一种基于大型AI模型（LAM）的两阶段方法，用于意图感知的智能体网络优化。首先，我们设计意图导向知识蒸馏（IoKD），将资源密集型LAM的意图理解能力高效蒸馏至轻量级边缘LAM（E-LAM），以服务终端用户。其次，我们开发共生强化学习（SRL），将E-LAM与基于策略的DRL框架相集成。在SRL中，E-LAM将自然语言用户意图转化为结构化偏好向量，用以指导状态表征和奖励设计；而DRL则根据实时网络条件优化生成式服务功能链组合与E-LAM选择，从而优化主观体验质量（QoE）。在包含81个智能体的网络中进行的大规模实验表明：IoKD将意图预测的均方误差降低达22.5%，而SRL在最大化意图感知QoE方面较传统通用DRL提升达23.5%。

Enhance Mobile Agents Thinking Process Via Iterative Preference Learning

Abstract

arXiv:2505.12299v1 Announce Type: cross Abstract: The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.

摘要

行动规划思维链（CoaT）范式已被证明能提升基于视觉语言模型的移动智能体在图形用户界面任务中的推理性能。然而，现有CoT轨迹的多样性不足限制了此类智能体的表达能力和泛化性能。尽管自训练常被用于缓解数据稀缺问题，现有方法要么忽视中间推理步骤的正确性，要么依赖高成本的过程级标注来构建过程奖励模型（PRM）。针对上述问题，我们提出迭代偏好学习（IPL）方法：通过多次采样构建CoT树，利用基于规则的奖励对叶节点评分，并通过反馈回传生成思维级直接偏好优化（T-DPO）配对。为防止监督微调预热阶段的过拟合，我们进一步设计三阶段指令进化策略，借助GPT-4o基于真实移动界面截图生成多样化问答对，同时增强通用性和布局理解能力。在三个标准移动GUI智能体基准上的实验表明，我们的MobileIPL智能体超越了包括持续预训练模型OS-ATLAS和UI-TARS在内的强基线，在全部基准测试中取得最优性能，并展现出优异的跨领域泛化能力。

The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models

Abstract

arXiv:2505.12287v1 Announce Type: cross Abstract: Large language models (LLMs) have seen widespread applications across various domains, yet remain vulnerable to adversarial prompt injections. While most existing research on jailbreak attacks and hallucination phenomena has focused primarily on open-source models, we investigate the frontier of closed-source LLMs under multilingual attack scenarios. We present a first-of-its-kind integrated adversarial framework that leverages diverse attack techniques to systematically evaluate frontier proprietary solutions, including GPT-4o, DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max. Our evaluation spans six categories of security contents in both English and Chinese, generating 38,400 responses across 32 types of jailbreak attacks. Attack success rate (ASR) is utilized as the quantitative metric to assess performance from three dimensions: prompt design, model architecture, and language environment. Our findings suggest that Qwen-Max is the most vulnerable, while GPT-4o shows the strongest defense. Notably, prompts in Chinese consistently yield higher ASRs than their English counterparts, and our novel Two-Sides attack technique proves to be the most effective across all models. This work highlights a dire need for language-aware alignment and robust cross-lingual defenses in LLMs, and we hope it will inspire researchers, developers, and policymakers toward more robust and inclusive AI systems.

摘要

大语言模型（LLMs）已在多个领域得到广泛应用，但仍易受对抗性提示注入攻击。尽管现有关于越狱攻击和幻觉现象的研究主要集中于开源模型，本研究首次针对闭源大语言模型在多语言攻击场景下的表现展开探索。我们提出了一种首创的集成对抗框架，通过融合多种攻击技术，系统评估了包括GPT-4o、DeepSeek-R1、Gemini-1.5-Pro和Qwen-Max在内的前沿商业解决方案。评估涵盖中英文六类安全内容，在32种越狱攻击类型下生成38,400条响应。采用攻击成功率（ASR）作为量化指标，从提示设计、模型架构和语言环境三个维度进行评估。研究发现Qwen-Max防御最薄弱，而GPT-4o表现出最强的防御能力。值得注意的是，中文提示的ASR始终高于英文对应项，且我们提出的双面攻击技术在所有模型中均表现最优。这项工作揭示了大语言模型亟需具备语言感知对齐能力和强大的跨语言防御机制，希望其能激励研究者、开发者和政策制定者共同构建更鲁棒且包容的人工智能系统。

Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds

Abstract

arXiv:2505.12349v1 Announce Type: cross Abstract: Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the "wisdom of the crowd", can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.

摘要

尽管大型语言模型（LLMs）性能卓越，但它们可能无意中延续训练数据中存在的偏见。通过分析LLMs对偏见诱发标题的响应，我们发现这些模型常常反映出人类偏见。为解决这一问题，我们探索了基于群体智慧的响应聚合策略来缓解偏见。首先证明，单纯平均多个LLMs的响应（旨在利用"群体智慧"）反而会加剧现有偏见，这是由于LLM群体内部多样性有限所致。相比之下，我们表明局部加权聚合方法能更有效地利用LLM群体的智慧，同时实现偏见缓解和准确率提升。最后，认识到LLMs（准确性）与人类（多样性）的互补优势，我们证明包含两者的混合群体能显著提升性能，并进一步减少涉及种族和性别相关语境中的偏见。

Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models

Abstract

arXiv:2505.12343v1 Announce Type: cross Abstract: Despite the impressive capabilities of Large Vision-Language Models (LVLMs), they remain susceptible to hallucinations-generating content that is inconsistent with the input image. Existing training-free hallucination mitigation methods often suffer from unstable performance and high sensitivity to hyperparameter settings, limiting their practicality and broader adoption. In this paper, we propose a novel decoding mechanism, Decoding with Inter-layer Consistency via Layer Aggregation (DCLA), which requires no retraining, fine-tuning, or access to external knowledge bases. Specifically, our approach constructs a dynamic semantic reference by aggregating representations from previous layers, and corrects semantically deviated layers to enforce inter-layer consistency. The method allows DCLA to robustly mitigate hallucinations across multiple LVLMs. Experiments on hallucination benchmarks such as MME and POPE demonstrate that DCLA effectively reduces hallucinations while enhancing the reliability and performance of LVLMs.

摘要

尽管大型视觉语言模型（LVLMs）展现出令人印象深刻的能力，它们仍容易产生幻觉——生成与输入图像不一致的内容。现有的免训练幻觉缓解方法往往存在性能不稳定和对超参数设置高度敏感的问题，这限制了其实用性和广泛采用。本文提出了一种新颖的解码机制——通过层聚合实现层间一致性的解码（DCLA），该方法无需重新训练、微调或访问外部知识库。具体而言，我们的方法通过聚合先前层的表征构建动态语义参考，并校正语义偏离层以强化层间一致性。该机制使DCLA能够稳健地缓解多种LVLMs的幻觉现象。在MME和POPE等幻觉基准测试上的实验表明，DCLA在提升LVLMs可靠性和性能的同时，有效减少了幻觉现象。

CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement

Abstract

arXiv:2505.12368v1 Announce Type: cross Abstract: Prompt injection remains a major security risk for large language models. However, the efficacy of existing guardrail models in context-aware settings remains underexplored, as they often rely on static attack benchmarks. Additionally, they have over-defense tendencies. We introduce CAPTURE, a novel context-aware benchmark assessing both attack detection and over-defense tendencies with minimal in-domain examples. Our experiments reveal that current prompt injection guardrail models suffer from high false negatives in adversarial cases and excessive false positives in benign scenarios, highlighting critical limitations.

摘要

提示注入仍然是大型语言模型面临的主要安全风险。然而现有防护模型在上下文感知环境中的有效性尚未得到充分研究，这些模型往往依赖于静态攻击基准测试。此外，它们还存在过度防御倾向。我们提出了CAPTURE这一新型上下文感知基准，该基准仅需少量域内示例即可同时评估攻击检测能力和过度防御倾向。实验表明，当前提示注入防护模型在对抗性案例中存在高假阴性率，在良性场景中又表现出过高的假阳性率，这些发现揭示了其关键局限性。

Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward

Abstract

arXiv:2505.12380v1 Announce Type: cross Abstract: Reinforcement learning (RL) has been widely adopted to enhance the performance of large language models (LLMs) on Text-to-SQL tasks. However, existing methods often rely on execution-based or LLM-based Bradley-Terry reward models. The former suffers from high execution latency caused by repeated database calls, whereas the latter imposes substantial GPU memory overhead, both of which significantly hinder the efficiency and scalability of RL pipelines. To this end, we propose a novel Text-to-SQL RL fine-tuning framework named Graph-Reward-SQL, which employs the GMNScore outcome reward model. We leverage SQL graph representations to provide accurate reward signals while significantly reducing inference time and GPU memory usage. Building on this foundation, we further introduce StepRTM, a stepwise reward model that provides intermediate supervision over Common Table Expression (CTE) subqueries. This encourages both functional correctness and structural clarity of SQL. Extensive comparative and ablation experiments on standard benchmarks, including Spider and BIRD, demonstrate that our method consistently outperforms existing reward models.

摘要

强化学习（RL）已被广泛用于提升大语言模型（LLMs）在文本到SQL任务中的表现。然而，现有方法通常依赖于基于执行或基于LLM的Bradley-Terry奖励模型。前者因重复的数据库调用导致高执行延迟，而后者则带来显著的GPU内存开销，这两者都严重影响了RL流程的效率和可扩展性。为此，我们提出了一种名为Graph-Reward-SQL的新型文本到SQL RL微调框架，该框架采用GMNScore结果奖励模型。我们利用SQL图表示来提供准确的奖励信号，同时显著减少推理时间和GPU内存使用。在此基础上，我们进一步引入了StepRTM，一种逐步奖励模型，用于对公共表表达式（CTE）子查询提供中间监督。这既鼓励了SQL的功能正确性，也提升了其结构的清晰性。在Spider和BIRD等标准基准上进行的大量对比和消融实验表明，我们的方法始终优于现有的奖励模型。

From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling

Abstract

arXiv:2505.12381v1 Announce Type: cross Abstract: Current research on bias in language models (LMs) predominantly focuses on data quality, with significantly less attention paid to model architecture and temporal influences of data. Even more critically, few studies systematically investigate the origins of bias. We propose a methodology grounded in comparative behavioral theory to interpret the complex interaction between training data and model architecture in bias propagation during language modeling. Building on recent work that relates transformers to n-gram LMs, we evaluate how data, model design choices, and temporal dynamics affect bias propagation. Our findings reveal that: (1) n-gram LMs are highly sensitive to context window size in bias propagation, while transformers demonstrate architectural robustness; (2) the temporal provenance of training data significantly affects bias; and (3) different model architectures respond differentially to controlled bias injection, with certain biases (e.g. sexual orientation) being disproportionately amplified. As language models become ubiquitous, our findings highlight the need for a holistic approach -- tracing bias to its origins across both data and model dimensions, not just symptoms, to mitigate harm.

摘要

当前关于语言模型（LM）偏见的研宄主要集中于数据质量，而对模型架构和数据时序影响的关注显著不足。更为关键的是，系统探究偏见起源的研宄极为有限。我们提出一种基于比较行为理论的方法论，用以解释语言建模过程中训练数据与模型架构在偏见传播中的复杂交互作用。基于近期将Transformer与n元语法语言模型相关联的研宄，我们评估了数据、模型设计选择及时序动态如何影响偏见传播。研究发现：(1) n元语法模型对上下文窗口大小在偏见传播中表现出高度敏感性，而Transformer则展现架构鲁棒性；(2) 训练数据的时序来源显著影响偏见程度；(3) 不同模型架构对受控偏见注入的响应存在差异，某些偏见（如性取向）会被不成比例地放大。随着语言模型的普及，我们的研宄结果强调需要采取整体性方法——从数据和模型双重维度追溯偏见根源而非仅关注表象，以此减轻潜在危害。

Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

Abstract

arXiv:2505.12363v1 Announce Type: cross Abstract: While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.

摘要

虽然多模态大语言模型（MLLMs）在通用视觉语言任务中表现出色，但视觉空间认知——即对空间布局、关系和动态的推理——仍然是一个重大挑战。现有模型通常缺乏实现细粒度空间理解所需的架构组件和专门训练数据。我们提出了ViCA2（视觉空间认知助手2），这是一种新型MLLM，旨在增强空间推理能力。ViCA2采用双视觉编码器架构，整合了用于语义理解的SigLIP和用于空间结构的Hiera，并辅以令牌比例控制机制提升效率。我们还开发了ViCA-322K数据集，该大规模数据集包含超过322,000个基于空间定位的问答对，用于针对性指令微调。在具有挑战性的VSI-Bench基准测试中，我们的ViCA2-7B模型以56.8的平均分达到最先进水平，显著超越更大的开源模型（如LLaVA-NeXT-Video-72B，40.9）和领先的专有模型（Gemini-1.5 Pro，45.4）。这证明了我们方法在紧凑模型上实现强大视觉空间智能的有效性。我们公开了ViCA2、其代码库及ViCA-322K数据集以促进进一步研究。

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Abstract

arXiv:2505.12366v1 Announce Type: cross Abstract: The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint, ensuring stable training. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for an 1.5B model.

摘要

DeepSeek-R1近期的成功与开源使群体相对策略优化（GRPO）作为大型推理模型（LRMs）的强化学习方法受到广泛关注。本研究在二元奖励设定下分析了GRPO目标函数，揭示了其存在的题目难度偏差固有局限，并发现了GRPO与传统监督学习中判别式方法的关联。基于这些发现，我们提出了一种基于判别学习原理的新框架——判别式约束优化（DisCO），用于增强LRMs。DisCO与GRPO及其近期变体的主要区别在于：（1）用评分函数定义的判别式目标取代群体相对目标；（2）放弃基于剪裁的替代目标，转而采用作为评分函数的非剪裁强化学习替代目标；（3）通过简单有效的约束优化方法实施KL散度约束，确保训练稳定性。因此，DisCO相比GRPO及其变体具有显著优势：（i）采用判别式目标彻底消除了难度偏差；（ii）通过非剪裁评分函数和约束优化方法解决了GRPO及其变体的熵不稳定问题；（iii）可整合先进的判别学习技术以应对数据不平衡问题——训练过程中大量题目生成的负面答案多于正面答案。在提升SFT微调模型数学推理能力的实验中，DisCO显著优于GRPO及其改进变体（如DAPO），在1.5B模型的六项基准任务上平均分别取得7%和6%的性能提升。

Traversal Verification for Speculative Tree Decoding

Abstract

arXiv:2505.12398v1 Announce Type: cross Abstract: Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token trees containing multiple candidates in each timestep. However, their reliance on token-level verification mechanisms introduces two critical limitations: First, the probability distribution of a sequence differs from that of individual tokens, leading to suboptimal acceptance length. Second, current verification schemes begin from the root node and proceed layer by layer in a top-down manner. Once a parent node is rejected, all its child nodes should be discarded, resulting in inefficient utilization of speculative candidates. This paper introduces Traversal Verification, a novel speculative decoding algorithm that fundamentally rethinks the verification paradigm through leaf-to-root traversal. Our approach considers the acceptance of the entire token sequence from the current node to the root, and preserves potentially valid subsequences that would be prematurely discarded by existing methods. We theoretically prove that the probability distribution obtained through Traversal Verification is identical to that of the target model, guaranteeing lossless inference while achieving substantial acceleration gains. Experimental results across different large language models and multiple tasks show that our method consistently improves acceptance length and throughput over existing methods

摘要

推测解码是一种加速大语言模型的有效方法。其核心思想是使用轻量级草稿模型推测目标模型在多个后续时间步的输出，并通过并行验证来决定是否接受或拒绝推测生成的标记。为提高接受率，现有框架通常构建包含每个时间步多个候选的标记树。然而，这些方法依赖标记级验证机制存在两个关键局限：首先，序列的概率分布与单个标记不同，导致接受长度不理想；其次，当前验证方案从根节点开始自上而下逐层进行，一旦父节点被拒绝，其所有子节点都将被丢弃，造成推测候选的低效利用。本文提出遍历验证算法，通过从叶节点到根节点的遍历方式彻底重构了验证范式。该方法综合考虑当前节点到根节点整个标记序列的接受情况，保留可能被现有方法过早丢弃的有效子序列。我们理论证明遍历验证获得的概率分布与目标模型完全一致，在实现显著加速的同时保证无损推理。跨不同大语言模型和多任务的实验结果表明，本方法在接受长度和吞吐量上均优于现有方法。

Table-R1: Region-based Reinforcement Learning for Table Understanding

Abstract

arXiv:2505.12415v1 Announce Type: cross Abstract: Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the parameters, while TARPO reduces response token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.

摘要

表格因其行列交互的结构化特性，对语言模型提出了独特挑战，需要采用专门方法以实现有效理解。尽管大型语言模型（LLM）通过提示技术（如思维链CoT和程序思维PoT）在表格推理中展现出潜力，但针对表格问答任务的性能优化仍待深入探索。本文提出基于区域的Table-R1方法——一种通过将区域证据整合至推理步骤来增强LLM表格理解能力的新型强化学习框架。该方法采用区域增强监督微调（RE-SFT），指导模型在生成答案前识别相关表格区域，并融合文本、符号和程序化推理。进一步地，表格感知分组相对策略优化（TARPO）引入混合奖励机制，通过衰减区域奖励和一致性惩罚来动态平衡区域准确性与答案正确性，从而对齐推理步骤。实验表明，Table-R1在三个基准数据集上对多个基础模型平均提升14.36个性能点，其表现甚至超越参数量十倍的基线模型；同时TARPO相较GRPO减少67.5%的响应token消耗，显著提升了LLM在高效表格推理方面的能力。

EvoGPT: Enhancing Test Suite Robustness via LLM-Based Generation and Genetic Optimization

Abstract

arXiv:2505.12424v1 Announce Type: cross Abstract: Large Language Models (LLMs) have recently emerged as promising tools for automated unit test generation. We introduce a hybrid framework called EvoGPT that integrates LLM-based test generation with evolutionary search techniques to create diverse, fault-revealing unit tests. Unit tests are initially generated with diverse temperature sampling to maximize behavioral and test suite diversity, followed by a generation-repair loop and coverage-guided assertion enhancement. The resulting test suites are evolved using genetic algorithms, guided by a fitness function prioritizing mutation score over traditional coverage metrics. This design emphasizes the primary objective of unit testing-fault detection. Evaluated on multiple open-source Java projects, EvoGPT achieves an average improvement of 10% in both code coverage and mutation score compared to LLMs and traditional search-based software testing baselines. These results demonstrate that combining LLM-driven diversity, targeted repair, and evolutionary optimization produces more effective and resilient test suites.

摘要

大语言模型（LLMs）近期成为自动化单元测试生成的有力工具。我们提出一种名为EvoGPT的混合框架，将基于LLM的测试生成与进化搜索技术相结合，以创建多样化且能揭示缺陷的单元测试。该框架首先生成具有多样温度采样的初始测试用例，最大化行为多样性与测试套件多样性，随后通过生成-修复循环和覆盖率引导的断言增强进行优化。最终测试套件采用遗传算法进化，其适应度函数优先考虑变异分数而非传统覆盖率指标，从而突出单元测试的核心目标——缺陷检测。在多个开源Java项目上的评估表明，相较于单纯使用LLM或传统基于搜索的软件测试基准方法，EvoGPT平均提升10%的代码覆盖率和变异分数。这些结果证明，结合LLM驱动的多样性生成、定向修复与进化优化，能够产生更有效且鲁棒的测试套件。

PSC: Extending Context Window of Large Language Models via Phase Shift Calibration

Abstract

arXiv:2505.12423v1 Announce Type: cross Abstract: Rotary Position Embedding (RoPE) is an efficient position encoding approach and is widely utilized in numerous large language models (LLMs). Recently, a lot of methods have been put forward to further expand the context window based on RoPE. The core concept of those methods is to predefine or search for a set of factors to rescale the base frequencies of RoPE. Nevertheless, it is quite a challenge for existing methods to predefine an optimal factor due to the exponential search space. In view of this, we introduce PSC (Phase Shift Calibration), a small module for calibrating the frequencies predefined by existing methods. With the employment of PSC, we demonstrate that many existing methods can be further enhanced, like PI, YaRN, and LongRoPE. We conducted extensive experiments across multiple models and tasks. The results demonstrate that (1) when PSC is enabled, the comparative reductions in perplexity increase as the context window size is varied from 16k, to 32k, and up to 64k. (2) Our approach is broadly applicable and exhibits robustness across a variety of models and tasks. The code can be found at https://github.com/WNQzhu/PSC.

摘要

旋转位置编码（RoPE）作为一种高效的位置编码方法，已被广泛应用于众多大语言模型（LLM）中。近期，大量研究基于RoPE提出了进一步扩展上下文窗口的方案，其核心思想是通过预定义或搜索一组缩放因子来调整RoPE的基频。然而，由于指数级搜索空间的存在，现有方法难以预定义最优缩放因子。针对此问题，我们提出相位偏移校准模块（PSC），用于对现有方法预定义的频率进行校准。实验表明，采用PSC可显著提升PI、YaRN、LongRoPE等现有方法的性能。我们在多模型多任务场景下开展广泛实验，结果显示：（1）启用PSC时，当上下文窗口从16k逐步扩展至32k和64k，困惑度的相对改善幅度持续增大；（2）该方法具有广泛适用性，在不同模型和任务中均表现出稳健性。代码已开源：https://github.com/WNQzhu/PSC。

SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment

Abstract

arXiv:2505.12435v1 Announce Type: cross Abstract: Direct Preference Optimization (DPO) is broadly utilized for aligning Large Language Models (LLMs) with human values because of its flexibility. Despite its effectiveness, it has been observed that the capability of DPO to generate human-preferred response is limited and the results of DPO are far from resilient. To address these limitations, in this paper we propose a novel Self-Guided Direct Preference Optimization algorithm, i.e., SGDPO, which incorporates a pilot term to steer the gradient flow during the optimization process, allowing for fine-grained control over the updates of chosen and rejected rewards. We provide a detailed theoretical analysis of our proposed method and elucidate its operational mechanism. Furthermore, we conduct comprehensive experiments on various models and benchmarks. The extensive experimental results demonstrate the consistency between the empirical results and our theoretical analysis and confirm the effectiveness of our proposed approach (up to 9.19% higher score).

摘要

直接偏好优化（DPO）因其灵活性被广泛应用于将大语言模型（LLM）与人类价值观对齐。尽管其效果显著，但研究发现DPO生成人类偏好响应的能力有限，且结果远非稳健。为应对这些局限性，本文提出一种新型自引导直接偏好优化算法SGDPO，该算法通过引入引导项来调控优化过程中的梯度流向，从而实现对选中奖励与拒绝奖励更新的细粒度控制。我们提供了所提方法的详细理论分析，并阐明其运作机制。此外，我们在多种模型和基准测试上进行了全面实验。大量实验结果表明，实证结果与理论分析具有一致性，并验证了所提方法的有效性（最高可获得9.19%的性能提升）。

SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization

Abstract

arXiv:2505.12433v1 Announce Type: cross Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method that injects two trainable low-rank matrices (A and B) into frozen pretrained models. While efficient, LoRA constrains updates to a fixed low-rank subspace (Delta W = BA), which can limit representational capacity and hinder downstream performance. We introduce Subspace Recomposition in Low-Rank Adaptation (SRLoRA) via importance-based fusion and reinitialization, a novel approach that enhances LoRA's expressiveness without compromising its lightweight structure. SRLoRA assigns importance scores to each LoRA pair (a column of B and the corresponding row of A), and dynamically recomposes the subspace during training. Less important pairs are fused into the frozen backbone, freeing capacity to reinitialize new pairs along unused principal directions derived from the pretrained weight's singular value decomposition. This mechanism enables continual subspace refreshment and richer adaptation over time, without increasing the number of trainable parameters. We evaluate SRLoRA on both language and vision tasks, including the GLUE benchmark and various image classification datasets. SRLoRA consistently achieves faster convergence and improved accuracy over standard LoRA, demonstrating its generality, efficiency, and potential for broader PEFT applications.

摘要

摘要：低秩适配（LoRA）是一种广泛采用的参数高效微调（PEFT）方法，其通过向冻结的预训练模型中注入两个可训练的低秩矩阵（A和B）来实现。尽管高效，LoRA将更新约束在固定的低秩子空间（Delta W = BA）内，这可能限制其表示能力并影响下游性能。我们提出了一种基于重要性融合与重新初始化的低秩适配子空间重组方法（SRLoRA），该方法在不牺牲LoRA轻量级结构的前提下增强了其表达能力。SRLoRA为每对LoRA矩阵（B的列与A的对应行）分配重要性分数，并在训练过程中动态重组子空间。重要性较低的矩阵对被融合到冻结的主干网络中，从而释放容量以沿预训练权重奇异值分解得到的未使用主方向重新初始化新的矩阵对。这一机制实现了子空间的持续更新和更丰富的适配过程，且不增加可训练参数数量。我们在语言和视觉任务（包括GLUE基准测试和多种图像分类数据集）上评估了SRLoRA。实验表明，SRLoRA在标准LoRA基础上始终实现更快的收敛速度和更高的准确率，验证了其通用性、高效性以及在更广泛PEFT应用中的潜力。

Towards Budget-Friendly Model-Agnostic Explanation Generation for Large Language Models

Abstract

arXiv:2505.12509v1 Announce Type: cross Abstract: With Large language models (LLMs) becoming increasingly prevalent in various applications, the need for interpreting their predictions has become a critical challenge. As LLMs vary in architecture and some are closed-sourced, model-agnostic techniques show great promise without requiring access to the model's internal parameters. However, existing model-agnostic techniques need to invoke LLMs many times to gain sufficient samples for generating faithful explanations, which leads to high economic costs. In this paper, we show that it is practical to generate faithful explanations for large-scale LLMs by sampling from some budget-friendly models through a series of empirical studies. Moreover, we show that such proxy explanations also perform well on downstream tasks. Our analysis provides a new paradigm of model-agnostic explanation methods for LLMs, by including information from budget-friendly models.

摘要

随着大语言模型（LLMs）在各种应用中的日益普及，解释其预测结果的需求已成为关键挑战。由于不同LLMs的架构存在差异且部分模型未开源，与模型无关的技术在不需访问模型内部参数的情况下展现出巨大潜力。然而，现有与模型无关的技术需要多次调用LLMs以获得足够样本生成可信解释，这导致高昂的经济成本。本文通过一系列实证研究表明，通过从某些经济型模型中进行采样，实际可为大规模LLMs生成可信解释。此外，我们发现此类代理解释在下游任务中同样表现良好。我们的分析为LLMs提供了一种新的与模型无关的解释方法范式，即通过纳入经济型模型的信息来实现。

Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning

Abstract

arXiv:2505.12432v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has shown promise in improving the reasoning abilities of Large Language Models (LLMs). However, the specific challenges of adapting RL to multimodal data and formats remain relatively unexplored. In this work, we present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs). We draw inspirations from human learning progression--from simple to complex and easy to difficult, and propose a gradual learning paradigm for MLLMs. To this end, we construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training. To tackle multimodal tasks, we introduce a multimodal format constraint that encourages careful observation of images, resulting in enhanced visual abilities and clearer and more structured responses. Additionally, we implement a bonus reward system that favors concise, correct answers within a length constraint, alongside a dynamic weighting mechanism that prioritizes uncertain and medium-difficulty problems, ensuring that more informative samples have a greater impact on training. Our experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks, achieving superior clarity and conciseness in reasoning chains. Ablation studies validate the effectiveness of our strategies, highlighting the robustness and generalization of our approach. The dataset and code will be released at https://github.com/zrguo/Observe-R1.

摘要

强化学习（RL）在提升大语言模型（LLMs）的推理能力方面展现出潜力。然而，如何将RL适应于多模态数据与格式的具体挑战仍待探索。本研究提出Observe-R1框架，旨在增强多模态大语言模型（MLLMs）的推理能力。受人类'由简入繁、由易到难'学习进程的启发，我们为MLLMs设计了一种渐进式学习范式。为此，我们构建了NeuraLadder数据集，其样本根据RL训练的难度与复杂度进行组织与采样。针对多模态任务，我们引入多模态格式约束机制，促使模型细致观察图像，从而提升视觉能力并生成更清晰、结构化的响应。此外，我们设计了奖励系统：在长度限制下优先奖励简洁正确的答案，同时采用动态权重机制，重点关注不确定性和中等难度问题，确保信息量更大的样本对训练产生更大影响。基于Qwen2.5-VL-3B和Qwen2.5-VL-7B模型在NeuraLadder数据集20k样本上的实验表明，Observe-R1在推理和通用基准测试中均优于一系列更大规模的推理模型，且推理链更清晰简洁。消融实验验证了我们策略的有效性，证明了方法的鲁棒性与泛化能力。数据集与代码将在https://github.com/zrguo/Observe-R1发布。

Enhancing Large Language Models with Reward-guided Tree Search for Knowledge Graph Question and Answering

Abstract

arXiv:2505.12476v1 Announce Type: cross Abstract: Recently, large language models (LLMs) have demonstrated impressive performance in Knowledge Graph Question Answering (KGQA) tasks, which aim to find answers based on knowledge graphs (KGs) for natural language questions. Existing LLMs-based KGQA methods typically follow the Graph Retrieval-Augmented Generation (GraphRAG) paradigm, which first retrieves reasoning paths from the large KGs, and then generates the answers based on them. However, these methods emphasize the exploration of new optimal reasoning paths in KGs while ignoring the exploitation of historical reasoning paths, which may lead to sub-optimal reasoning paths. Additionally, the complex semantics contained in questions may lead to the retrieval of inaccurate reasoning paths. To address these issues, this paper proposes a novel and training-free framework for KGQA tasks called Reward-guided Tree Search on Graph (RTSoG). RTSoG decomposes an original question into a series of simpler and well-defined sub-questions to handle the complex semantics. Then, a Self-Critic Monte Carlo Tree Search (SC-MCTS) guided by a reward model is introduced to iteratively retrieve weighted reasoning paths as contextual knowledge. Finally, it stacks the weighted reasoning paths according to their weights to generate the final answers. Extensive experiments on four datasets demonstrate the effectiveness of RTSoG. Notably, it achieves 8.7% and 7.0% performance improvement over the state-of-the-art method on the GrailQA and the WebQSP respectively.

摘要

近期，大语言模型（LLMs）在知识图谱问答（KGQA）任务中展现出卓越性能，该任务旨在基于知识图谱（KGs）为自然语言问题寻找答案。现有基于LLMs的KGQA方法通常遵循图检索增强生成（GraphRAG）范式，即先从大型KGs中检索推理路径，再基于这些路径生成答案。然而，这些方法侧重于探索KGs中的新最优推理路径，却忽视了对历史推理路径的利用，可能导致次优推理路径的产生。此外，问题中蕴含的复杂语义可能导致检索到不准确的推理路径。针对这些问题，本文提出一种无需训练的新型KGQA框架——基于奖励引导的图树搜索（RTSoG）。RTSoG将原始问题分解为一系列更简单且定义明确的子问题以处理复杂语义，随后引入由奖励模型引导的自批判蒙特卡洛树搜索（SC-MCTS）迭代检索加权推理路径作为上下文知识，最后根据权重堆叠这些路径以生成最终答案。在四个数据集上的大量实验验证了RTSoG的有效性。值得注意的是，其在GrailQA和WebQSP上分别比现有最优方法实现了8.7%和7.0%的性能提升。

IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

Abstract

arXiv:2505.12442v1 Announce Type: cross Abstract: The rapid advancement of Large Language Models (LLMs) has led to the emergence of Multi-Agent Systems (MAS) to perform complex tasks through collaboration. However, the intricate nature of MAS, including their architecture and agent interactions, raises significant concerns regarding intellectual property (IP) protection. In this paper, we introduce MASLEAK, a novel attack framework designed to extract sensitive information from MAS applications. MASLEAK targets a practical, black-box setting, where the adversary has no prior knowledge of the MAS architecture or agent configurations. The adversary can only interact with the MAS through its public API, submitting attack query $q$ and observing outputs from the final agent. Inspired by how computer worms propagate and infect vulnerable network hosts, MASLEAK carefully crafts adversarial query $q$ to elicit, propagate, and retain responses from each MAS agent that reveal a full set of proprietary components, including the number of agents, system topology, system prompts, task instructions, and tool usages. We construct the first synthetic dataset of MAS applications with 810 applications and also evaluate MASLEAK against real-world MAS applications, including Coze and CrewAI. MASLEAK achieves high accuracy in extracting MAS IP, with an average attack success rate of 87% for system prompts and task instructions, and 92% for system architecture in most cases. We conclude by discussing the implications of our findings and the potential defenses.

摘要

大型语言模型（LLMs）的快速发展促使多智能体系统（MAS）通过协作执行复杂任务。然而，MAS的复杂性（包括其架构和智能体交互）引发了关于知识产权（IP）保护的重要问题。本文提出MASLEAK——一种新型攻击框架，旨在从MAS应用中提取敏感信息。该框架针对实际黑盒场景设计，攻击者无需预先了解MAS架构或智能体配置，仅能通过公共API与系统交互，提交攻击查询 $q$ 并观察最终智能体的输出。受计算机蠕虫传播感染脆弱网络主机的机制启发，MASLEAK精心构造对抗性查询 $q$ ，以逐层诱发、传播并保留来自每个MAS智能体的响应，从而完整揭示包括智能体数量、系统拓扑、系统提示、任务指令及工具使用在内的全套专有组件。我们构建了首个包含810个应用的MAS合成数据集，并在真实MAS应用（包括Coze和CrewAI）上评估MASLEAK。实验表明，该框架在提取MAS知识产权方面具有高准确率：在多数情况下，系统提示和任务指令的平均攻击成功率达87%，系统架构达92%。最后我们讨论了研究发现的潜在影响及可能的防御措施。

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

Abstract

arXiv:2505.12504v1 Announce Type: cross Abstract: Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.

摘要

基于规则的强化学习（RL）最新进展显著提升了语言模型（LM）在规则奖励下的推理能力。然而现有RL方法——如GRPO、REINFORCE++和RLOO——常面临训练不稳定的问题，大幅策略更新和不恰当裁剪可能导致训练崩溃。为解决该问题，我们提出带策略漂移约束的裁剪策略梯度优化算法（CPGD），该新型算法旨在稳定语言模型的策略学习。CPGD基于KL散度引入策略漂移约束以动态正则化策略更新，并通过对数比率裁剪机制防止策略过度更新。我们为CPGD提供了理论证明，并通过实证分析表明其有效缓解了现有方法的不稳定性。此外，实验证明CPGD在保持训练稳定性的同时显著提升性能。该实现平衡了理论严谨性与实践可用性，为语言模型微调中的强化学习提供了稳健方案。代码已发布于https://github.com/ModalMinds/MM-EUREKA。

Measuring Information Distortion in Hierarchical Ultra long Novel Generation:The Optimal Expansion Ratio

Abstract

arXiv:2505.12572v1 Announce Type: cross Abstract: Writing novels with Large Language Models (LLMs) raises a critical question: how much human-authored outline is necessary to generate high-quality million-word novels? While frameworks such as DOME, Plan&Write, and Long Writer have improved stylistic coherence and logical consistency, they primarily target shorter novels (10k--100k words), leaving ultra-long generation largely unexplored. Drawing on insights from recent text compression methods like LLMZip and LLM2Vec, we conduct an information-theoretic analysis that quantifies distortion occurring when LLMs compress and reconstruct ultra-long novels under varying compression-expansion ratios. We introduce a hierarchical two-stage generation pipeline (outline -> detailed outline -> manuscript) and find an optimal outline length that balances information preservation with human effort. Through extensive experimentation with Chinese novels, we establish that a two-stage hierarchical outline approach significantly reduces semantic distortion compared to single-stage methods. Our findings provide empirically-grounded guidance for authors and researchers collaborating with LLMs to create million-word novels.

摘要

使用大语言模型（LLMs）创作小说引发了一个关键问题：生成高质量百万字小说需要多少人工撰写的提纲？尽管DOME、Plan&Write和Long Writer等框架已提升了风格连贯性与逻辑一致性，但这些方法主要针对较短篇幅小说（1万至10万字），超长篇生成领域仍待探索。基于LLMZip和LLM2Vec等最新文本压缩方法的启示，我们通过信息论分析量化了LLMs在不同压缩-扩展比率下对超长篇小说进行压缩与重构时产生的失真程度。本文提出分层两阶段生成流程（提纲→详细提纲→手稿），并发现存在一个最优提纲长度能在信息保留与人力投入间取得平衡。通过对中文小说的广泛实验，我们证实相较于单阶段方法，分层两阶段提纲策略能显著降低语义失真。本研究为作者与研究者协作LLMs创作百万字小说提供了基于实证的指导。

A Survey of Attacks on Large Language Models

Abstract

arXiv:2505.12567v1 Announce Type: cross Abstract: Large language models (LLMs) and LLM-based agents have been widely deployed in a wide range of applications in the real world, including healthcare diagnostics, financial analysis, customer support, robotics, and autonomous driving, expanding their powerful capability of understanding, reasoning, and generating natural languages. However, the wide deployment of LLM-based applications exposes critical security and reliability risks, such as the potential for malicious misuse, privacy leakage, and service disruption that weaken user trust and undermine societal safety. This paper provides a systematic overview of the details of adversarial attacks targeting both LLMs and LLM-based agents. These attacks are organized into three phases in LLMs: Training-Phase Attacks, Inference-Phase Attacks, and Availability & Integrity Attacks. For each phase, we analyze the details of representative and recently introduced attack methods along with their corresponding defenses. We hope our survey will provide a good tutorial and a comprehensive understanding of LLM security, especially for attacks on LLMs. We desire to raise attention to the risks inherent in widely deployed LLM-based applications and highlight the urgent need for robust mitigation strategies for evolving threats.

摘要

大语言模型（LLMs）及基于LLM的智能体已在现实世界的广泛应用中部署，包括医疗诊断、金融分析、客户支持、机器人技术和自动驾驶等领域，展现出其在自然语言理解、推理与生成方面的强大能力。然而，基于LLM应用的广泛部署也暴露出关键的安全与可靠性风险，例如恶意滥用、隐私泄露和服务中断等潜在威胁，这些风险削弱了用户信任并危及社会安全。本文系统梳理了针对LLMs及基于LLM智能体的对抗攻击细节，将这些攻击按LLM生命周期划分为三个阶段：训练阶段攻击、推理阶段攻击以及可用性与完整性攻击。针对每个阶段，我们分析了具有代表性及最新提出的攻击方法及其对应防御措施。本研究旨在为LLM安全领域，特别是针对LLMs的攻击研究提供详尽的教程式综述与全面理解。我们呼吁关注广泛部署的LLM应用所固有的风险，并强调针对持续演变的威胁制定鲁棒缓解策略的紧迫性。

AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection

Abstract

arXiv:2505.12594v1 Announce Type: cross Abstract: Anomaly detection (AD) is essential in areas such as fraud detection, network monitoring, and scientific research. However, the diversity of data modalities and the increasing number of specialized AD libraries pose challenges for non-expert users who lack in-depth library-specific knowledge and advanced programming skills. To tackle this, we present AD-AGENT, an LLM-driven multi-agent framework that turns natural-language instructions into fully executable AD pipelines. AD-AGENT coordinates specialized agents for intent parsing, data preparation, library and model selection, documentation mining, and iterative code generation and debugging. Using a shared short-term workspace and a long-term cache, the agents integrate popular AD libraries like PyOD, PyGOD, and TSLib into a unified workflow. Experiments demonstrate that AD-AGENT produces reliable scripts and recommends competitive models across libraries. The system is open-sourced to support further research and practical applications in AD.

摘要

异常检测（AD）在欺诈检测、网络监控和科学研究等领域具有重要作用。然而，数据模态的多样性和专业化AD库数量的增加，给缺乏深入库特定知识和高级编程技能的非专家用户带来了挑战。为解决这一问题，我们提出了AD-AGENT——一个基于大语言模型（LLM）的多智能体框架，能够将自然语言指令转换为完全可执行的AD流程。该框架通过协调意图解析、数据准备、库与模型选择、文档挖掘以及迭代式代码生成与调试等专业化智能体，借助共享短期工作空间和长期缓存机制，将PyOD、PyGOD和TSLib等主流AD库整合至统一工作流中。实验表明，AD-AGENT能生成可靠脚本并推荐跨库的竞争力模型。本系统已开源以支持AD领域的进一步研究和实际应用。

Web IP at Risk: Prevent Unauthorized Real-Time Retrieval by Large Language Models

Abstract

arXiv:2505.12655v1 Announce Type: cross Abstract: Protecting cyber Intellectual Property (IP) such as web content is an increasingly critical concern. The rise of large language models (LLMs) with online retrieval capabilities presents a double-edged sword that enables convenient access to information but often undermines the rights of original content creators. As users increasingly rely on LLM-generated responses, they gradually diminish direct engagement with original information sources, significantly reducing the incentives for IP creators to contribute, and leading to a saturating cyberspace with more AI-generated content. In response, we propose a novel defense framework that empowers web content creators to safeguard their web-based IP from unauthorized LLM real-time extraction by leveraging the semantic understanding capability of LLMs themselves. Our method follows principled motivations and effectively addresses an intractable black-box optimization problem. Real-world experiments demonstrated that our methods improve defense success rates from 2.5% to 88.6% on different LLMs, outperforming traditional defenses such as configuration-based restrictions.

摘要

保护网络知识产权（如网页内容）已成为日益关键的问题。具备在线检索能力的大型语言模型（LLMs）的兴起犹如双刃剑：虽便于信息获取，却常损害原创内容创作者权益。随着用户日益依赖LLM生成的回答，其与原始信息源的直接互动逐渐减少，这显著降低了知识产权创造者的贡献动力，导致网络空间逐渐饱和更多AI生成内容。为此，我们提出一种新型防御框架，通过利用LLM自身的语义理解能力，帮助网页内容创作者防范未经授权的LLM实时抓取。该方法遵循原则性动机，并有效解决了棘手的黑盒优化问题。真实场景实验表明，我们的防御方案在不同LLM上将防御成功率从2.5%提升至88.6%，显著优于基于配置限制等传统防御手段。

Know3-RAG: A Knowledge-aware RAG Framework with Adaptive Retrieval, Generation, and Filtering

Abstract

arXiv:2505.12662v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have led to impressive progress in natural language generation, yet their tendency to produce hallucinated or unsubstantiated content remains a critical concern. To improve factual reliability, Retrieval-Augmented Generation (RAG) integrates external knowledge during inference. However, existing RAG systems face two major limitations: (1) unreliable adaptive control due to limited external knowledge supervision, and (2) hallucinations caused by inaccurate or irrelevant references. To address these issues, we propose Know3-RAG, a knowledge-aware RAG framework that leverages structured knowledge from knowledge graphs (KGs) to guide three core stages of the RAG process, including retrieval, generation, and filtering. Specifically, we introduce a knowledge-aware adaptive retrieval module that employs KG embedding to assess the confidence of the generated answer and determine retrieval necessity, a knowledge-enhanced reference generation strategy that enriches queries with KG-derived entities to improve generated reference relevance, and a knowledge-driven reference filtering mechanism that ensures semantic alignment and factual accuracy of references. Experiments on multiple open-domain QA benchmarks demonstrate that Know3-RAG consistently outperforms strong baselines, significantly reducing hallucinations and enhancing answer reliability.

摘要

尽管大语言模型（LLM）在自然语言生成方面取得了显著进展，但其产生幻觉或未经证实内容的倾向仍是一个关键问题。为提高事实可靠性，检索增强生成（RAG）在推理过程中整合了外部知识。然而，现有RAG系统面临两大局限：（1）由于外部知识监督有限导致的自适应控制不可靠；（2）不准确或无关参考引发的幻觉。针对这些问题，我们提出Know3-RAG框架，该知识感知RAG系统利用知识图谱（KG）的结构化知识指导RAG流程的三个核心阶段——检索、生成与过滤。具体而言，我们设计了知识感知自适应检索模块（通过KG嵌入评估生成答案置信度以决定检索必要性）、知识增强参考生成策略（利用KG实体扩展查询以提高生成参考的相关性）以及知识驱动的参考过滤机制（确保参考的语义对齐与事实准确性）。在多个开放域QA基准测试上的实验表明，Know3-RAG始终优于强基线模型，显著减少幻觉并提升答案可靠性。

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

Abstract

arXiv:2505.12632v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving an average performance gain of 18.11%p on an unseen mobile OS platform. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework comprises robust OCR-based scene detection (95.04% F1score), near-perfect UI element detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.

摘要

大型语言模型（LLMs）与视觉语言模型（VLMs）的最新进展引发了开发GUI视觉代理的广泛兴趣。我们提出MONDAY（基于YouTube视频的移动操作系统导航任务代理数据集），这是一个从2万条教学视频中提取的31.3万帧标注数据构成的大规模数据集，涵盖了多平台下多样化的真实世界移动操作系统导航场景。在预训练阶段引入MONDAY的模型展现出强大的跨平台泛化能力，其性能始终优于基于现有单一操作系统数据集训练的模型，并在未见过的移动操作系统平台上实现了18.11%的平均性能提升。为支持数据集随移动平台演进而持续扩展，我们提出一种自动化框架，该框架利用公开视频内容构建无需人工标注的完整任务数据集。我们的框架包含基于OCR的鲁棒场景检测（95.04% F1分数）、接近完美的UI元素检测（99.87%命中率），以及创新的多步骤动作识别技术，可跨多样界面配置提取可靠动作序列。我们同时贡献MONDAY数据集与自动化采集框架，以推动移动操作系统导航领域的未来研究。

EpiLLM: Unlocking the Potential of Large Language Models in Epidemic Forecasting

Abstract

arXiv:2505.12738v1 Announce Type: cross Abstract: Advanced epidemic forecasting is critical for enabling precision containment strategies, highlighting its strategic importance for public health security. While recent advances in Large Language Models (LLMs) have demonstrated effectiveness as foundation models for domain-specific tasks, their potential for epidemic forecasting remains largely unexplored. In this paper, we introduce EpiLLM, a novel LLM-based framework tailored for spatio-temporal epidemic forecasting. Considering the key factors in real-world epidemic transmission: infection cases and human mobility, we introduce a dual-branch architecture to achieve fine-grained token-level alignment between such complex epidemic patterns and language tokens for LLM adaptation. To unleash the multi-step forecasting and generalization potential of LLM architectures, we propose an autoregressive modeling paradigm that reformulates the epidemic forecasting task into next-token prediction. To further enhance LLM perception of epidemics, we introduce spatio-temporal prompt learning techniques, which strengthen forecasting capabilities from a data-driven perspective. Extensive experiments show that EpiLLM significantly outperforms existing baselines on real-world COVID-19 datasets and exhibits scaling behavior characteristic of LLMs.

摘要

先进的疫情预测技术对于实现精准防控策略至关重要，其对公共卫生安全的战略意义日益凸显。尽管近期大语言模型（LLMs）作为领域专用基础模型已展现出卓越性能，但其在疫情预测领域的潜力仍待深入探索。本文提出EpiLLM——一种基于LLM的新型时空疫情预测框架。针对现实疫情传播中的关键因素（感染病例与人口流动），我们设计了双分支架构，通过细粒度令牌级对齐实现复杂疫情模式与语言令牌的适配。为释放LLM架构的多步预测与泛化潜力，我们构建了自回归建模范式，将疫情预测任务重构为下一令牌预测问题。为进一步增强LLM对疫情特征的感知能力，提出了时空提示学习技术，从数据驱动角度强化预测性能。大量实验表明，EpiLLM在真实世界COVID-19数据集上显著超越现有基线，并展现出LLM特有的规模扩展特性。

Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization

Abstract

arXiv:2505.12763v1 Announce Type: cross Abstract: Reward models (RMs) play a crucial role in reinforcement learning from human feedback (RLHF), aligning model behavior with human preferences. However, existing benchmarks for reward models show a weak correlation with the performance of optimized policies, suggesting that they fail to accurately assess the true capabilities of RMs. To bridge this gap, we explore several evaluation designs through the lens of reward overoptimization\textemdash a phenomenon that captures both how well the reward model aligns with human preferences and the dynamics of the learning signal it provides to the policy. The results highlight three key findings on how to construct a reliable benchmark: (i) it is important to minimize differences between chosen and rejected responses beyond correctness, (ii) evaluating reward models requires multiple comparisons across a wide range of chosen and rejected responses, and (iii) given that reward models encounter responses with diverse representations, responses should be sourced from a variety of models. However, we also observe that a extremely high correlation with degree of overoptimization leads to comparatively lower correlation with certain downstream performance. Thus, when designing a benchmark, it is desirable to use the degree of overoptimization as a useful tool, rather than the end goal.

摘要

奖励模型（RMs）在基于人类反馈的强化学习（RLHF）中起着关键作用，其功能是将模型行为与人类偏好对齐。然而，现有奖励模型基准测试与优化策略性能之间的相关性较弱，这表明这些基准无法准确评估奖励模型的真实能力。为弥补这一差距，我们通过奖励过优化现象（该现象同时反映了奖励模型与人类偏好的对齐程度及其为策略提供的学习信号动态）探索了多种评估设计方案。研究结果揭示了构建可靠基准的三个关键发现：（i）必须最小化选定与拒绝响应之间除正确性之外的差异；（ii）评估奖励模型需要在广泛范围的选定与拒绝响应中进行多重比较；（iii）鉴于奖励模型会处理具有多样化表征的响应，响应数据应来源于多种模型。但我们也发现，与过优化程度的高度相关性会导致与某些下游性能的相关性相对降低。因此，在设计基准时，应将过优化程度作为有用工具而非终极目标来使用。

Shadow-FT: Tuning Instruct via Base

Abstract

arXiv:2505.12716v1 Announce Type: cross Abstract: Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the INSTRUCT (i.e., instruction tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired BASE models, the foundation for these INSTRUCT variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). Therefore, we propose a novel Shadow-FT framework to tune the INSTRUCT models by leveraging the corresponding BASE models. The key insight is to fine-tune the BASE model, and then directly graft the learned weight updates to the INSTRUCT model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization (DPO). Codes and weights are available at \href{https://github.com/wutaiqiang/Shadow-FT}{Github}.

摘要

大型语言模型（LLMs）通常能通过针对不同任务的进一步微调获得性能提升。然而，我们观察到直接对INSTRUCT（即指令微调）模型进行微调往往仅带来边际改进甚至导致性能退化。值得注意的是，作为这些INSTRUCT变体基础的配对BASE模型，其权重值具有高度相似性（例如Llama 3.1 8B模型平均差异小于2%）。为此，我们提出新型Shadow-FT框架，通过利用对应BASE模型来微调INSTRUCT模型。该方法的核心理念是先微调BASE模型，然后将学习到的权重更新直接移植到INSTRUCT模型。我们提出的Shadow-FT无需引入额外参数，实现简便且能显著提升性能。我们在Qwen 3和Llama 3等主流LLMs上进行了大量实验，并在涵盖编码、推理和数学任务的19个基准测试中开展评估。实验结果表明，Shadow-FT始终优于传统的全参数和参数高效微调方法。进一步分析表明，Shadow-FT可应用于多模态大语言模型（MLLMs）并与直接偏好优化（DPO）相结合。代码及权重文件详见\href{https://github.com/wutaiqiang/Shadow-FT}{Github}。

SynDec: A Synthesize-then-Decode Approach for Arbitrary Textual Style Transfer via Large Language Models

Abstract

arXiv:2505.12821v1 Announce Type: cross Abstract: Large Language Models (LLMs) are emerging as dominant forces for textual style transfer. However, for arbitrary style transfer, LLMs face two key challenges: (1) considerable reliance on manually-constructed prompts and (2) rigid stylistic biases inherent in LLMs. In this paper, we propose a novel Synthesize-then-Decode (SynDec) approach, which automatically synthesizes high-quality prompts and amplifies their roles during decoding process. Specifically, our approach synthesizes prompts by selecting representative few-shot samples, conducting a four-dimensional style analysis, and reranking the candidates. At LLM decoding stage, the TST effect is amplified by maximizing the contrast in output probabilities between scenarios with and without the synthesized prompt, as well as between prompts and negative samples. We conduct extensive experiments and the results show that SynDec outperforms existing state-of-the-art LLM-based methods on five out of six benchmarks (e.g., achieving up to a 9% increase in accuracy for modern-to-Elizabethan English transfer). Detailed ablation studies further validate the effectiveness of SynDec.

摘要

大语言模型（LLMs）正逐渐成为文本风格转换的主导力量。然而，在任意风格转换任务中，LLMs面临两个关键挑战：（1）严重依赖人工构建的提示词；（2）模型内部固有的刚性风格偏差。本文提出一种创新的"合成-解码"（SynDec）方法，能自动生成高质量提示词并在解码阶段增强其作用。具体而言，该方法通过选择代表性少样本、进行四维风格分析及候选重排来合成提示词。在LLM解码阶段，通过最大化"使用合成提示词"与"无提示词"场景的输出概率差异，以及提示词与负样本间的对比度，从而放大风格转换效果。大量实验表明，SynDec在六项基准测试中有五项超越现有最先进的基于LLM的方法（例如在现代英语到伊丽莎白时代英语转换任务中准确率最高提升9%）。详尽的消融研究进一步验证了SynDec的有效性。

PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLMs

Abstract

arXiv:2505.12814v1 Announce Type: cross Abstract: Existing LLM-based role-playing methods often rely on superficial textual descriptions or simplistic metrics, inadequately modeling both intrinsic and extrinsic character dimensions. Additionally, they typically simulate character memory with implicit model knowledge or basic retrieval augment generation without explicit memory alignment, compromising memory consistency. The two issues weaken reliability of role-playing LLMs in several applications, such as trustworthy social simulation. To address these limitations, we propose PsyMem, a novel framework integrating fine-grained psychological attributes and explicit memory control for role-playing. PsyMem supplements textual descriptions with 26 psychological indicators to detailed model character. Additionally, PsyMem implements memory alignment training, explicitly trains the model to align character's response with memory, thereby enabling dynamic memory-controlled responding during inference. By training Qwen2.5-7B-Instruct on our specially designed dataset (including 5,414 characters and 38,962 dialogues extracted from novels), the resulting model, termed as PsyMem-Qwen, outperforms baseline models in role-playing, achieving the best performance in human-likeness and character fidelity.

摘要

现有基于大语言模型（LLM）的角色扮演方法通常依赖浅层的文本描述或简单指标，未能充分建模角色的内在与外在维度。此外，这些方法通常通过隐式模型知识或基础检索增强生成来模拟角色记忆，缺乏显式的记忆对齐机制，导致记忆一致性受损。这两个问题削弱了角色扮演LLM在可信社交模拟等应用中的可靠性。为解决这些局限，我们提出PsyMem框架——一种整合细粒度心理属性与显式记忆控制的新型角色扮演方案。PsyMem通过26项心理指标补充文本描述，实现角色精细建模；同时采用记忆对齐训练，显式指导模型将角色响应与记忆对齐，从而在推理阶段实现动态记忆控制响应。通过在专门构建的数据集（包含从小说中提取的5,414个角色和38,962段对话）上训练Qwen2.5-7B-Instruct模型，所得PsyMem-Qwen在角色扮演任务中超越基线模型，在拟人化程度与角色保真度方面达到最佳表现。

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Abstract

arXiv:2505.12781v1 Announce Type: cross Abstract: Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.

摘要

训练高性能小型语言模型（SLMs）的成本仍然高昂，即使通过知识蒸馏和从大型教师模型中进行剪枝。现有工作常面临三个关键挑战：（1）硬剪枝导致的信息损失，（2）表征对齐效率低下，（3）信息性激活（尤其是前馈网络FFN）的利用不足。为解决这些问题，我们提出低秩克隆（LRC）——一种高效的预训练方法，通过构建行为上等效于强教师模型的SLMs。LRC训练一组低秩投影矩阵，通过压缩教师权重实现软剪枝，并通过将学生激活（包括FFN信号）与教师对齐实现激活克隆。这种统一设计在最大化知识迁移的同时，消除了显式对齐模块的需求。基于开源教师模型（如Llama-3.2-3B-Instruct、Qwen2.5-3B/7B-Instruct）的大量实验表明，LRC在仅使用200亿标记的情况下，性能匹配或超越基于数万亿标记训练的最先进模型，实现超过1000倍的训练效率。代码与模型检查点详见https://github.com/CURRENTF/LowRankClone 和 https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf。

Bias Fitting to Mitigate Length Bias of Reward Model in RLHF

Abstract

arXiv:2505.12843v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences. However, RLHF often suffers from reward hacking, wherein policy learning exploits flaws in the trained reward model to maximize reward scores without genuinely aligning with human preferences. A significant example of such reward hacking is length bias, where reward models usually favor longer responses irrespective of actual response quality. Previous works on length bias have notable limitations, these approaches either mitigate bias without characterizing the bias form, or simply assume a linear length-reward relation. To accurately model the intricate nature of length bias and facilitate more effective bias mitigation, we propose FiMi-RM (Bias Fitting to Mitigate Length Bias of Reward Model in RLHF), a framework that autonomously learns and corrects underlying bias patterns. Our approach consists of three stages: First, we train a standard reward model which inherently contains length bias. Next, we deploy a lightweight fitting model to explicitly capture the non-linear relation between length and reward. Finally, we incorporate this learned relation into the reward model to debias. Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution. Furthermore, when applied to alignment algorithms, our debiased reward model improves length-controlled win rate and reduces verbosity without compromising its performance.

摘要

基于人类反馈的强化学习依赖奖励模型来实现大语言模型与人类偏好的对齐。然而该方法常面临奖励破解问题，即策略学习通过利用训练奖励模型的缺陷来最大化奖励分数，而非真正实现人类偏好对齐。长度偏差是该问题的典型表现，即奖励模型倾向于给长文本更高评分而忽视实际质量。现有长度偏差研究存在明显局限：这些方法要么未明确偏差形式就进行缓解，要么简单假设长度与奖励呈线性关系。为精确建模长度偏差的复杂特性并实现更有效的偏差缓解，我们提出FiMi-RM框架（通过偏差拟合缓解RLHF中奖励模型的长度偏差），该框架能自主学习并修正潜在偏差模式。我们的方法包含三个阶段：首先训练含有固有长度偏差的标准奖励模型；随后部署轻量级拟合模型显式捕捉长度与奖励间的非线性关系；最后将学习到的关系整合至奖励模型以实现去偏。实验结果表明FiMi-RM能获得更均衡的长度-奖励分布。此外，当应用于对齐算法时，经过去偏的奖励模型在保持性能的同时，提高了长度控制胜率并降低了冗余度。

FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA

Abstract

arXiv:2505.12805v1 Announce Type: cross Abstract: Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in federated learning (FL). However, when combined with differentially private stochastic gradient descent (DP-SGD), LoRA faces substantial noise amplification: DP-SGD perturbs per-sample gradients, and the matrix multiplication of the LoRA update ( $BA$ ) intensifies this effect. Freezing one matrix (e.g., $A$ ) reduces the noise but restricts model expressiveness, often resulting in suboptimal adaptation. To address this, we propose FedSVD, a simple yet effective method that introduces a global reparameterization based on singular value decomposition (SVD). In our approach, each client optimizes only the $B$ matrix and transmits it to the server. The server aggregates the $B$ matrices, computes the product $BA$ using the previous $A$ , and refactorizes the result via SVD. This yields a new adaptive $A$ composed of the orthonormal right singular vectors of $BA$ , and an updated $B$ containing the remaining SVD components. This reparameterization avoids quadratic noise amplification, while allowing $A$ to better capture the principal directions of the aggregate updates. Moreover, the orthonormal structure of $A$ bounds the gradient norms of $B$ and preserves more signal under DP-SGD, as confirmed by our theoretical analysis. As a result, FedSVD consistently improves stability and performance across a variety of privacy settings and benchmarks, outperforming relevant baselines under both private and non-private regimes.

摘要

低秩自适应（LoRA）通过引入两个可训练低秩矩阵的乘积到冻结的预训练权重中，被广泛用于联邦学习（FL）中语言模型的高效微调。然而，当与差分隐私随机梯度下降（DP-SGD）结合时，LoRA面临显著的噪声放大问题：DP-SGD扰动每个样本的梯度，而LoRA更新（ $BA$ ）的矩阵乘法加剧了这一效应。冻结其中一个矩阵（如 $A$ ）可减少噪声，但会限制模型表达能力，通常导致次优的自适应效果。为解决这一问题，我们提出FedSVD，这是一种简单而有效的方法，引入基于奇异值分解（SVD）的全局重参数化。在我们的方法中，每个客户端仅优化 $B$ 矩阵并将其传输至服务器。服务器聚合所有 $B$ 矩阵，使用先前的 $A$ 计算乘积 $BA$ ，并通过SVD对结果进行重构。这产生了一个由 $BA$ 的正交右奇异向量组成的新自适应 $A$ ，以及一个包含剩余SVD分量的更新后的 $B$ 。这种重参数化避免了二次噪声放大，同时使 $A$ 能更好地捕捉聚合更新的主方向。此外， $A$ 的正交结构限制了 $B$ 的梯度范数，并在DP-SGD下保留了更多信号，这一点已通过我们的理论分析得到证实。因此，FedSVD在各种隐私设置和基准测试中持续提升了稳定性和性能，在隐私和非隐私机制下均优于相关基线方法。

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Abstract

arXiv:2505.12864v1 Announce Type: cross Abstract: Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/

摘要

尽管近期测试时扩展技术取得进展，长篇幅法律推理仍是大型语言模型（LLMs）面临的核心挑战。本文提出LEXam基准测试集，该数据集源自116门法学院课程中的340份法律考试，涵盖多学科与不同学位层级。数据集包含4,886道英文和德文法律试题，其中2,841道为开放式问答题，2,045道为选择题。除参考答案外，开放题还附有明确指引，阐明预期的法律推理方法，如争议点识别、规则回溯或规则适用。我们对开放式与选择题的评估表明，当前LLMs面临重大挑战——尤其在需要结构化多步骤法律推理的开放题上表现显著不足。此外，实验结果凸显了该数据集在区分不同能力模型方面的有效性。通过采用"LLM即评判者"范式并辅以严格的人类专家验证，我们证明了模型生成推理步骤可被一致且准确地评估。该评估框架提供了超越简单准确率指标的法律推理质量规模化测评方法。项目页面：https://lexam-benchmark.github.io/

Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?

Abstract

arXiv:2505.12871v1 Announce Type: cross Abstract: Low rank adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate the security implications of LoRA's low-rank structure during fine-tuning, in the context of its robustness against data poisoning and backdoor attacks. We propose an analytical framework that models LoRA's training dynamics, employs the neural tangent kernel to simplify the analysis of the training process, and applies information theory to establish connections between LoRA's low rank structure and its vulnerability against training-time attacks. Our analysis indicates that LoRA exhibits better robustness to backdoor attacks than full fine-tuning, while becomes more vulnerable to untargeted data poisoning due to its over-simplified information geometry. Extensive experimental evaluations have corroborated our theoretical findings.

摘要

低秩自适应（LoRA）因其相较于先前方法显著的效率优势，已成为微调大语言模型（LLM）的重要技术。尽管已有大量研究探讨了LoRA的性能与结构特性，但其在训练时遭受攻击的行为仍缺乏深入探究，这带来了重大安全风险。本文从数据投毒和后门攻击的鲁棒性角度，理论研究了LoRA低秩结构在微调过程中的安全影响。我们提出一个分析框架：通过建模LoRA的训练动态，利用神经正切核简化训练过程分析，并应用信息论建立LoRA低秩结构与攻击脆弱性之间的关联。分析表明，LoRA相较于全参数微调对后门攻击具有更强鲁棒性，但由于其过度简化的信息几何结构，对无目标数据投毒更为敏感。大量实验验证了我们的理论发现。

The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting

Abstract

arXiv:2505.12837v1 Announce Type: cross Abstract: Legal contracts possess an inherent, semantically vital structure (e.g., sections, clauses) that is crucial for human comprehension but whose impact on LLM processing remains under-explored. This paper investigates the effects of explicit input text structure and prompt engineering on the performance of GPT-4o and GPT-4.1 on a legal question-answering task using an excerpt of the CUAD. We compare model exact-match accuracy across various input formats: well-structured plain-text (human-generated from CUAD), plain-text cleaned of line breaks, extracted plain-text from Azure OCR, plain-text extracted by GPT-4o Vision, and extracted (and interpreted) Markdown (MD) from GPT-4o Vision. To give an indication of the impact of possible prompt engineering, we assess the impact of shifting task instructions to the system prompt and explicitly informing the model about the structured nature of the input. Our findings reveal that GPT-4o demonstrates considerable robustness to variations in input structure, but lacks in overall performance. Conversely, GPT-4.1's performance is markedly sensitive; poorly structured inputs yield suboptimal results (but identical with GPT-4o), while well-structured formats (original CUAD text, GPT-4o Vision text and GPT-4o MD) improve exact-match accuracy by ~20 percentage points. Optimizing the system prompt to include task details and an advisory about structured input further elevates GPT-4.1's accuracy by an additional ~10-13 percentage points, with Markdown ultimately achieving the highest performance under these conditions (79 percentage points overall exact-match accuracy). This research empirically demonstrates that while newer models exhibit greater resilience, careful input structuring and strategic prompt design remain critical for optimizing the performance of LLMs, and can significantly affect outcomes in high-stakes legal applications.

摘要

法律合同具有固有的、语义上至关重要的结构（如章节、条款），这种结构对人类理解至关重要，但其对大型语言模型（LLM）处理的影响尚未得到充分探索。本文通过使用CUAD摘录的法律问答任务，研究了显式输入文本结构和提示工程对GPT-4o和GPT-4.1性能的影响。我们比较了不同输入格式下的模型精确匹配准确率：结构良好的纯文本（从CUAD人工生成）、去除换行符的纯文本、Azure OCR提取的纯文本、GPT-4o Vision提取的纯文本，以及GPT-4o Vision提取（并解释）的Markdown（MD）格式。为评估提示工程的潜在影响，我们分析了将任务指令移至系统提示以及明确告知模型输入结构化特性的效果。研究发现，GPT-4o对输入结构变化表现出较强的鲁棒性，但整体性能不足；而GPT-4.1的性能则显著敏感——结构不良的输入会导致次优结果（与GPT-4o相同），而结构良好的格式（原始CUAD文本、GPT-4o Vision文本和GPT-4o MD）可将精确匹配准确率提高约20个百分点。通过优化系统提示以包含任务细节和结构化输入建议，GPT-4.1的准确率可再提升约10-13个百分点，其中Markdown格式在这些条件下最终达到最高性能（总体精确匹配准确率为79个百分点）。本研究实证表明，尽管新模型展现出更强的适应性，但精细的输入结构设计和策略性提示工程仍是优化LLM性能的关键，并可能对高风险法律应用的结果产生重大影响。

AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models

Abstract

arXiv:2505.12900v1 Announce Type: cross Abstract: Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline-from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.

摘要

地理空间代码生成正逐渐成为人工智能与地学分析融合的关键方向。然而，该领域目前仍缺乏标准化的自动评估工具。为此，我们提出了AutoGEEval——首个基于大语言模型、面向Google Earth Engine（GEE）平台地理空间代码生成任务的多模态单元级自动化评估框架。该框架以GEE Python API为基础，构建了包含26种GEE数据类型、1325个测试案例的基准测试集（AutoGEEval-Bench）。通过集成问题生成与答案验证组件，该框架实现了从函数调用到执行验证的端到端自动化评估流程。AutoGEEval支持从准确性、资源消耗、执行效率和错误类型等多维度对模型输出进行定量分析。我们对18个前沿大语言模型（包括通用型、推理增强型、代码专用型及地学专用型）进行了评估，揭示了其在GEE代码生成中的性能特征与潜在优化路径。本研究为地理空间代码生成模型的开发与评估提供了统一协议和基础资源，推动了自然语言到领域专用代码自动翻译的前沿发展。

Sinusoidal Initialization, Time for a New Start

Abstract

arXiv:2505.12909v1 Announce Type: cross Abstract: Initialization plays a critical role in Deep Neural Network training, directly influencing convergence, stability, and generalization. Common approaches such as Glorot and He initializations rely on randomness, which can produce uneven weight distributions across layer connections. In this paper, we introduce the Sinusoidal initialization, a novel deterministic method that employs sinusoidal functions to construct structured weight matrices expressly to improve the spread and balance of weights throughout the network while simultaneously fostering a more uniform, well-conditioned distribution of neuron activation states from the very first forward pass. Because Sinusoidal initialization begins with weights and activations that are already evenly and efficiently utilized, it delivers consistently faster convergence, greater training stability, and higher final accuracy across a wide range of models, including convolutional neural networks, vision transformers, and large language models. On average, our experiments show an increase of 4.8 % in final validation accuracy and 20.9 % in convergence speed. By replacing randomness with structure, this initialization provides a stronger and more reliable foundation for Deep Learning systems.

摘要

初始化在深度神经网络训练中起着关键作用，直接影响模型的收敛性、稳定性和泛化能力。当前常用的Glorot和He初始化等方法依赖于随机性，可能导致各层连接间的权重分布不均。本文提出正弦初始化（Sinusoidal initialization）这一新颖的确定性方法，该方法利用正弦函数构建结构化权重矩阵，旨在改善网络整体权重的分布广度与平衡性，同时促进神经元激活状态从首次前向传播开始就形成更均匀、良态化的分布。由于正弦初始化使权重和激活从一开始就得到均衡高效利用，该方法在包括卷积神经网络、视觉变换器和大型语言模型在内的多种模型中，均展现出更快的收敛速度、更强的训练稳定性以及更高的最终准确率。实验表明，该方法平均可提升4.8%的最终验证准确率和20.9%的收敛速度。通过用结构化设计替代随机性，该初始化方法为深度学习系统提供了更强大可靠的训练基础。

Leveraging LLM Inconsistency to Boost Pass@k Performance

Abstract

arXiv:2505.12938v1 Announce Type: cross Abstract: Large language models (LLMs) achieve impressive abilities in numerous domains, but exhibit inconsistent performance in response to minor input changes. Rather than view this as a drawback, in this paper we introduce a novel method for leveraging models' inconsistency to boost Pass@k performance. Specifically, we present a "Variator" agent that generates k variants of a given task and submits one candidate solution for each one. Our variant generation approach is applicable to a wide range of domains as it is task agnostic and compatible with free-form inputs. We demonstrate the efficacy of our agent theoretically using a probabilistic model of the inconsistency effect, and show empirically that it outperforms the baseline on the APPS dataset. Furthermore, we establish that inconsistency persists even in frontier reasoning models across coding and cybersecurity domains, suggesting our method is likely to remain relevant for future model generations.

摘要

大语言模型（LLMs）在众多领域展现出卓越能力，但对细微输入变化的响应表现存在不一致性。本文并未将此视为缺陷，而是提出了一种利用模型不一致性来提升Pass@k性能的新方法。具体而言，我们设计了一个"变体生成器"代理，可为给定任务生成k个变体，并为每个变体提交一个候选解决方案。该变体生成方法具有任务无关性且兼容自由格式输入，适用于广泛领域。我们通过建立不一致性效应的概率模型从理论上验证了代理的有效性，并在APPS数据集上实证表明其性能优于基线方法。此外，我们发现前沿推理模型在编程和网络安全领域仍存在不一致性，这表明我们的方法对未来模型迭代仍具有适用价值。

Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

Abstract

arXiv:2505.12929v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.

摘要

强化学习（RL）已成为提升大语言模型（LLM）推理能力的关键技术，近期提出的组相对策略优化（GRPO）等方法展现出卓越成效。本研究揭示了一个长期被忽视的RL训练问题：低概率token因其梯度幅值过大，会不成比例地主导模型更新。这种主导效应阻碍了高概率token的有效学习——尽管后者对LLM性能至关重要，但其梯度被严重抑制。为缓解这一干扰，我们提出两种创新方法：优势重加权（Advantage Reweighting）与低概率token隔离（Lopti），二者能有效衰减低概率token的梯度，同时强化由高概率token驱动的参数更新。我们的方法促进了不同概率token间的平衡更新，从而提升RL训练效率。实验结果表明，这些方法显著改善了GRPO训练的LLM性能，在K&K逻辑谜题推理任务中最高实现46.2%的性能提升。代码实现详见https://github.com/zhyang2226/AR-Lopti。

DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management

Abstract

arXiv:2505.12951v1 Announce Type: cross Abstract: Inference scaling further accelerates Large Language Models (LLMs) toward Artificial General Intelligence (AGI), with large-scale Reinforcement Learning (RL) to unleash long Chain-of-Thought reasoning. Most contemporary reasoning approaches usually rely on handcrafted rule-based reward functions. However, the tarde-offs of exploration and exploitation in RL algorithms involves multiple complex considerations, and the theoretical and empirical impacts of manually designed reward functions remain insufficiently explored. In this paper, we propose Decoupled Group Reward Optimization (DGRO), a general RL algorithm for LLM reasoning. On the one hand, DGRO decouples the traditional regularization coefficient into two independent hyperparameters: one scales the policy gradient term, and the other regulates the distance from the sampling policy. This decoupling not only enables precise control over balancing exploration and exploitation, but also can be seamlessly extended to Online Policy Mirror Descent (OPMD) algorithms in Kimi k1.5 and Direct Reward Optimization. On the other hand, we observe that reward variance significantly affects both convergence speed and final model performance. We conduct both theoretical analysis and extensive empirical validation to assess DGRO, including a detailed ablation study that investigates its performance and optimization dynamics. Experimental results show that DGRO achieves state-of-the-art performance on the Logic dataset with an average accuracy of 96.9%, and demonstrates strong generalization across mathematical benchmarks.

摘要

推理规模化进一步加速了大语言模型（LLMs）向人工通用智能（AGI）的发展，通过大规模强化学习（RL）释放长链思维推理能力。当前多数推理方法通常依赖于手工设计的基于规则的奖励函数。然而，RL算法中探索与利用的权衡涉及多重复杂考量，且人工设计奖励函数的理论与实证影响尚未得到充分研究。本文提出解耦分组奖励优化（DGRO），一种适用于LLM推理的通用RL算法。一方面，DGRO将传统正则化系数解耦为两个独立超参数：一个缩放策略梯度项，另一个调节采样策略的距离。这种解耦不仅能精确控制探索与利用的平衡，还可无缝扩展至Kimi k1.5中的在线策略镜像下降（OPMD）算法和直接奖励优化。另一方面，我们发现奖励方差显著影响收敛速度和最终模型性能。通过理论分析和大量实证验证（包括探究其性能与优化动态的详细消融实验）评估DGRO。实验结果表明，DGRO在Logic数据集上以96.9%的平均准确率取得最先进性能，并在数学基准测试中展现出强大的泛化能力。

CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

Abstract

arXiv:2505.12925v1 Announce Type: cross Abstract: Competitive programming benchmarks are widely used in scenarios such as programming contests and large language model assessments. However, the growing presence of duplicate or highly similar problems raises concerns not only about competition fairness, but also about the validity of competitive programming as a benchmark for model evaluation. In this paper, we propose a new problem -- similar question retrieval -- to address this issue. Due to the lack of both data and models, solving this problem is challenging. To this end, we introduce CPRet, a retrieval-oriented benchmark suite for competitive programming, covering four retrieval tasks: two code-centric (i.e., Text-to-Code and Code-to-Code) and two newly proposed problem-centric tasks (i.e., Problem-to-Duplicate and Simplified-to-Full), built from a combination of automatically crawled problem-solution data and manually curated annotations. Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation. In addition, we develop two task-specialized retrievers based on this dataset: CPRetriever-Code, trained with a novel Group-InfoNCE loss for problem-code alignment, and CPRetriever-Prob, fine-tuned for identifying problem-level similarity. Both models achieve strong results and are open-sourced for local use. Finally, we analyze LiveCodeBench and find that high-similarity problems inflate model pass rates and reduce differentiation, underscoring the need for similarity-aware evaluation in future benchmarks. Code and data are available at: https://github.com/coldchair/CPRet

摘要

竞争性编程基准在编程竞赛和大语言模型评估等场景中被广泛使用。然而，重复或高度相似题目数量的不断增加，不仅引发了关于竞赛公平性的担忧，也对竞争性编程作为模型评估基准的有效性提出了质疑。本文提出通过解决"相似题目检索"这一新问题来应对该挑战。由于缺乏数据和模型支持，该问题的解决存在较大难度。为此，我们构建了CPRet——一个面向检索任务的竞争性编程基准套件，涵盖四项检索任务：两项代码中心任务（文本到代码和代码到代码）以及两项新提出的问题中心任务（问题到重复题和简化题到原题），该套件基于自动爬取的问题-解决方案数据和人工标注共同构建。我们的贡献包括高质量的训练数据以及时间分隔的测试集以确保可靠评估。此外，基于该数据集开发了两个专用检索模型：采用新型Group-InfoNCE损失进行问题-代码对齐训练的CPRetriever-Code，以及针对问题级相似度识别微调的CPRetriever-Prob。两个模型均取得优异性能并已开源供本地使用。最后，通过对LiveCodeBench的分析发现，高相似度题目会虚增模型通过率并降低区分度，这凸显了未来基准测试中引入相似度感知评估的必要性。代码与数据详见：https://github.com/coldchair/CPRet

A3 : an Analytical Low-Rank Approximation Framework for Attention

Abstract

arXiv:2505.12942v1 Announce Type: cross Abstract: Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches for decomposed small matrices. To address these limitations, we propose $\tt A^\tt 3$ , a post-training low-rank approximation framework. $\tt A^\tt 3$ splits a Transformer layer into three functional components, namely $\tt QK$ , $\tt OV$ , and $\tt MLP$ . For each component, $\tt A^\tt 3$ provides an analytical solution that reduces the hidden dimension size inside each component while minimizing the component's functional loss ( $\it i.e.$ , error in attention scores, attention outputs, and MLP outputs). This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. In addition, it provides a new narrative in advancing the optimization problem from singular linear layer loss optimization toward improved end-to-end performance. Through extensive experiments, we show that $\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also demonstrate the versatility of $\tt A^\tt 3$ , including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.

摘要

大型语言模型展现出卓越性能，但其庞大的参数量导致部署成本极高。低秩近似提供了一种有效的压缩方案，然而现有方法存在两大局限：(1) 这些方法仅关注最小化单个线性层的输出误差，未考虑Transformer架构特性；(2) 将大权重矩阵分解为两个小型低秩矩阵。这导致此类方法在压缩效果上常逊色于剪枝和量化等技术，并引入运行时开销（如分解后小矩阵所需的额外GEMM内核启动）。为突破这些限制，我们提出 $t A^ t 3$ 训练后低秩近似框架。该框架将Transformer层划分为 $t QK$ 、 $t OV$ 和 $t MLP$ 三个功能组件，针对每个组件提供解析解：在最小化组件功能损失（即注意力分数误差、注意力输出误差和MLP输出误差）的同时，缩减组件内部隐藏维度大小。该方法直接减小模型规模、KV缓存大小和浮点运算量，且不引入额外运行时开销。此外，它将优化问题的研究视角从单一线性层损失优化推进到端到端性能提升。大量实验表明， $t A^ t 3$ 在相同计算内存压缩预算下保持最优性能。例如LLaMA 3.1-70B模型在WikiText-2数据集上困惑度达4.69，较先前最优结果的7.87提升3.18。我们还验证了框架的多功能性，包括KV缓存压缩、量化及混合秩分配等性能增强应用。

An Empirical Study of Many-to-Many Summarization with Large Language Models

Abstract

arXiv:2505.12983v1 Announce Type: cross Abstract: Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can significantly improve their M2MS ability, and outperform zero-shot LLMs (including GPT-4) in terms of automatic evaluations. In addition, we demonstrate that this task-specific improvement does not sacrifice the LLMs' general task-solving abilities. However, as revealed by our human evaluation, LLMs still face the factuality issue, and the instruction tuning might intensify the issue. Thus, how to control factual errors becomes the key when building LLM summarizers in real applications, and is worth noting in future research.

摘要

多对多摘要生成（M2MS）旨在处理任意语言的文档并生成相应语言的摘要。近期，大语言模型（LLMs）展现出强大的多语言能力，使其具备在实际应用中执行M2MS任务的潜力。本研究对大语言模型的M2MS能力进行了系统性实证分析。具体而言，我们首先基于八个原有领域专用数据集重构了M2MS数据。重构后的数据集包含47.8K个样本，涵盖五个领域和六种语言，可用于大语言模型的训练与评估。随后，我们以零样本和指令微调两种方式对18个大语言模型进行基准测试，同时对比了经过微调的传统模型（如mBART）。实验表明：零样本大语言模型取得了与微调传统模型相当的结果；经过指令微调后，开源大语言模型的M2MS能力显著提升，在自动评估指标上优于零样本大语言模型（包括GPT-4）。此外，我们证实这种任务特异性提升不会削弱模型的通用任务解决能力。然而，人工评估揭示大语言模型仍存在事实性错误问题，且指令微调可能加剧该现象。因此，在实际应用中构建大语言模型摘要器时，如何控制事实错误成为关键问题，值得未来研究重点关注。

Fractured Chain-of-Thought Reasoning

Abstract

arXiv:2505.12992v1 Announce Type: cross Abstract: Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning.

摘要

推理阶段扩展技术通过在不重新训练的情况下利用额外的计算资源，显著增强了大型语言模型（LLM）的推理能力。类似地，思维链（CoT）提示及其扩展形式长思维链（Long CoT）通过生成丰富的中间推理轨迹来提高准确性，但这些方法会产生大量的令牌成本，阻碍了其在延迟敏感场景中的部署。在本研究中，我们首先表明，截断式思维链（truncated CoT）——即在推理完成前停止并直接生成最终答案——通常能够匹配完整思维链采样的效果，同时大幅减少令牌使用量。基于这一发现，我们提出了分段采样（Fractured Sampling），这是一种统一的推理阶段策略，可在完整思维链与仅生成解决方案的采样之间沿三个正交维度进行插值：（1）推理轨迹的数量，（2）每条轨迹的最终解决方案数量，以及（3）推理轨迹截断的深度。通过在五个多样化推理基准和多个模型规模上的大量实验，我们证明分段采样始终能够实现更优的准确性与成本权衡，在Pass@k与令牌预算的关系中呈现出陡峭的对数线性扩展增益。我们的分析揭示了如何在这些维度上分配计算资源以最大化性能，为更高效、可扩展的LLM推理铺平了道路。

From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents

Abstract

arXiv:2505.12981v1 Announce Type: cross Abstract: The growing adoption of large language models (LLMs) has led to a new paradigm in mobile computing--LLM-powered mobile AI agents--capable of decomposing and automating complex tasks directly on smartphones. However, the security implications of these agents remain largely unexplored. In this paper, we present the first comprehensive security analysis of mobile LLM agents, encompassing three representative categories: System-level AI Agents developed by original equipment manufacturers (e.g., YOYO Assistant), Third-party Universal Agents (e.g., Zhipu AI AutoGLM), and Emerging Agent Frameworks (e.g., Alibaba Mobile Agent). We begin by analyzing the general workflow of mobile agents and identifying security threats across three core capability dimensions: language-based reasoning, GUI-based interaction, and system-level execution. Our analysis reveals 11 distinct attack surfaces, all rooted in the unique capabilities and interaction patterns of mobile LLM agents, and spanning their entire operational lifecycle. To investigate these threats in practice, we introduce AgentScan, a semi-automated security analysis framework that systematically evaluates mobile LLM agents across all 11 attack scenarios. Applying AgentScan to nine widely deployed agents, we uncover a concerning trend: every agent is vulnerable to targeted attacks. In the most severe cases, agents exhibit vulnerabilities across eight distinct attack vectors. These attacks can cause behavioral deviations, privacy leakage, or even full execution hijacking. Based on these findings, we propose a set of defensive design principles and practical recommendations for building secure mobile LLM agents. Our disclosures have received positive feedback from two major device vendors. Overall, this work highlights the urgent need for standardized security practices in the fast-evolving landscape of LLM-driven mobile automation.

摘要

随着大语言模型(LLMs)的广泛应用，移动计算领域出现了一种新范式——基于LLM的移动AI代理，这种代理能够直接在智能手机上分解并自动化执行复杂任务。然而，这些代理的安全影响尚未得到充分研究。本文首次对移动LLM代理进行了全面安全分析，涵盖三大典型类别：原始设备制造商开发的系统级AI代理(如YOYO助手)、第三方通用代理(如智谱AutoGLM)以及新兴代理框架(如阿里巴巴移动代理)。我们首先分析了移动代理的通用工作流程，并从三个核心能力维度识别安全威胁：基于语言的推理、基于图形界面的交互以及系统级执行。研究发现存在11个不同的攻击面，均源于移动LLM代理的独特能力和交互模式，并贯穿其整个运行生命周期。为实际验证这些威胁，我们开发了AgentScan半自动化安全分析框架，可系统评估移动LLM代理在所有11种攻击场景下的安全性。通过对9个广泛部署的代理进行测试，发现一个严峻现象：每个代理都存在针对性攻击漏洞。最严重情况下，单个代理存在八种不同攻击向量的漏洞，可能导致行为偏差、隐私泄露甚至完全执行劫持。基于这些发现，我们提出了一套防御性设计原则和构建安全移动LLM代理的实用建议。相关披露已获得两家主要设备厂商的积极反馈。总体而言，本研究揭示了在快速发展的LLM驱动移动自动化领域建立标准化安全实践的迫切需求。

ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning

Abstract

arXiv:2505.12996v1 Announce Type: cross Abstract: In recent years, the emergence of large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, has shown impressive capabilities in complex problems, e.g., mathematics and coding. Some pioneering studies attempt to bring the success of LRMs in neural machine translation (MT). They try to build LRMs with deep reasoning MT ability via reinforcement learning (RL). Despite some progress that has been made, these attempts generally focus on several high-resource languages, e.g., English and Chinese, leaving the performance on other languages unclear. Besides, the reward modeling methods in previous work do not fully unleash the potential of reinforcement learning in MT. In this work, we first design a new reward modeling method that compares the translation results of the policy MT model with a strong LRM (i.e., DeepSeek-R1-671B), and quantifies the comparisons to provide rewards. Experimental results demonstrate the superiority of the reward modeling method. Using Qwen2.5-7B-Instruct as the backbone, the trained model achieves the new state-of-the-art performance in literary translation, and outperforms strong LRMs including OpenAI-o1 and DeepSeeK-R1. Furthermore, we extend our method to the multilingual settings with 11 languages. With a carefully designed lightweight reward modeling in RL, we can simply transfer the strong MT ability from a single direction into multiple (i.e., 90) translation directions and achieve impressive multilingual MT performance.

摘要

近年来，大型推理模型（LRMs）如OpenAI-o1和DeepSeek-R1的出现，在数学与编程等复杂任务中展现出卓越能力。部分前沿研究尝试将LRMs的成功经验引入神经机器翻译（MT）领域，通过强化学习（RL）构建具备深度推理能力的机器翻译模型。尽管已取得一定进展，但这些尝试通常仅针对英语、汉语等高资源语言，其在不同语种上的性能表现尚不明确。此外，既有研究中的奖励建模方法未能充分释放强化学习在机器翻译中的潜力。本研究首先设计了一种新型奖励建模方法：通过对比策略模型与强效LRM（DeepSeek-R1-671B）的译文结果，将质量差异量化为奖励信号。实验结果表明该奖励建模方法具有显著优势。以Qwen2.5-7B-Instruct为基底的训练模型在文学翻译任务中达到最新最优水平，其性能超越OpenAI-o1与DeepSeek-R1等强效LRMs。进一步地，我们将该方法扩展至11种语言的多语种场景。通过精心设计的轻量化RL奖励建模，成功将单一方向的强翻译能力迁移至90个翻译方向，实现了卓越的多语言机器翻译性能。

Advancing Sequential Numerical Prediction in Autoregressive Models

Abstract

arXiv:2505.13077v1 Announce Type: cross Abstract: Autoregressive models have become the de facto choice for sequence generation tasks, but standard approaches treat digits as independent tokens and apply cross-entropy loss, overlooking the coherent structure of numerical sequences. This paper introduces Numerical Token Integrity Loss (NTIL) to address this gap. NTIL operates at two levels: (1) token-level, where it extends the Earth Mover's Distance (EMD) to preserve ordinal relationships between numerical values, and (2) sequence-level, where it penalizes the overall discrepancy between the predicted and actual sequences. This dual approach improves numerical prediction and integrates effectively with LLMs/MLLMs. Extensive experiments show significant performance improvements with NTIL.

摘要

自回归模型已成为序列生成任务的事实标准，但传统方法将数字视为独立标记并应用交叉熵损失，忽略了数值序列的内在连贯结构。本文提出数值标记完整性损失（NTIL）来解决这一缺陷。NTIL在两个层面发挥作用：（1）标记层面，通过扩展地球移动距离（EMD）来保持数值间的序数关系；（2）序列层面，对预测序列与真实序列的整体差异进行惩罚。这种双重机制显著提升了数值预测性能，并能与LLMs/MLLMs有效集成。大量实验表明，NTIL带来了显著的性能提升。

KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

Abstract

arXiv:2505.13036v1 Announce Type: cross Abstract: The scope of the International Workshop on Spoken Language Translation (IWSLT) has recently broadened beyond traditional Speech Translation (ST) to encompass a wider array of tasks, including Speech Question Answering and Summarization. This shift is partly driven by the growing capabilities of modern systems, particularly with the success of Large Language Models (LLMs). In this paper, we present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional refinement step to improve translation quality. For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks. We complement it with a final document-level refinement stage to further enhance output quality by using contextual information.

摘要

国际口语翻译研讨会（IWSLT）的研究范围近期已从传统语音翻译（ST）扩展到更广泛的任务领域，包括语音问答与摘要生成。这一转变部分源于现代系统（尤其是大语言模型（LLM）的成功应用）不断增强的能力。本文介绍了卡尔斯鲁厄理工学院在离线语音翻译和指令跟随（IF）赛道的参赛方案，我们通过LLM提升所有任务的性能表现。针对离线语音翻译赛道，我们提出了一种采用多自动语音识别系统的处理流程，其输出通过具备文档级上下文理解的LLM进行融合，随后执行包含额外优化步骤的两阶段翻译流程以提升译文质量。对于指令跟随赛道，我们开发了将语音编码器与LLM相结合的端到端模型，可执行多样化指令跟随任务，并通过最终文档级优化阶段利用上下文信息进一步提升输出质量。

Structure-Aware Corpus Construction and User-Perception-Aligned Metrics for Large-Language-Model Code Completion

Abstract

arXiv:2505.13073v1 Announce Type: cross Abstract: Code completion technology based on large language model has significantly improved the development efficiency of programmers. However, in practical applications, there remains a gap between current commonly used code completion evaluation metrics and users' actual perception. To address this issue, we propose two evaluation metrics for code completion tasks--LCP and ROUGE-LCP, from the perspective of probabilistic modeling. Furthermore, to tackle the lack of effective structural semantic modeling and cross-module dependency information in LLMs for repository-level code completion scenarios, we propose a data processing method based on a Structure-Preserving and Semantically-Reordered Code Graph (SPSR-Graph). Through theoretical analysis and experimental validation, we demonstrate the superiority of the proposed evaluation metrics in terms of user perception consistency, as well as the effectiveness of the data processing method in enhancing model performance.

摘要

基于大语言模型的代码补全技术显著提升了程序员的开发效率。然而在实际应用中，当前常用的代码补全评估指标与用户实际感知之间仍存在差距。针对这一问题，我们从概率建模角度提出了两种代码补全任务评估指标——LCP和ROUGE-LCP。此外，为解决大语言模型在仓库级代码补全场景中缺乏有效的结构语义建模和跨模块依赖信息的问题，我们提出了一种基于结构保持与语义重排序代码图（SPSR-Graph）的数据处理方法。通过理论分析和实验验证，我们证明了所提评估指标在用户感知一致性方面的优越性，以及数据处理方法在提升模型性能方面的有效性。

Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs

Abstract

arXiv:2505.13026v1 Announce Type: cross Abstract: Large language models (LLMs) excel at mathematical reasoning and logical problem-solving. The current popular training paradigms primarily use supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance the models' reasoning abilities. However, when using SFT or RL alone, there are respective challenges: SFT may suffer from overfitting, while RL is prone to mode collapse. The state-of-the-art methods have proposed hybrid training schemes. However, static switching faces challenges such as poor generalization across different tasks and high dependence on data quality. In response to these challenges, inspired by the curriculum learning-quiz mechanism in human reasoning cultivation, We propose SASR, a step-wise adaptive hybrid training framework that theoretically unifies SFT and RL and dynamically balances the two throughout optimization. SASR uses SFT for initial warm-up to establish basic reasoning skills, and then uses an adaptive dynamic adjustment algorithm based on gradient norm and divergence relative to the original distribution to seamlessly integrate SFT with the online RL method GRPO. By monitoring the training status of LLMs and adjusting the training process in sequence, SASR ensures a smooth transition between training schemes, maintaining core reasoning abilities while exploring different paths. Experimental results demonstrate that SASR outperforms SFT, RL, and static hybrid training methods.

摘要

大语言模型（LLMs）在数学推理和逻辑问题解决方面表现出色。当前主流的训练范式主要采用监督微调（SFT）和强化学习（RL）来提升模型的推理能力。然而，单独使用SFT或RL时存在各自的挑战：SFT可能面临过拟合问题，而RL容易出现模式崩溃。最先进的方法提出了混合训练方案，但静态切换面临泛化能力差、对数据质量依赖性强等挑战。针对这些问题，受人类推理培养中课程学习-测验机制的启发，我们提出SASR——一种分阶段自适应的混合训练框架，该框架在理论上统一了SFT与RL，并在整个优化过程中动态平衡两者。SASR首先使用SFT进行初始预热以建立基础推理能力，随后基于梯度范数及与原始分布散度的自适应动态调整算法，将SFT与在线RL方法GRPO无缝集成。通过监控LLMs的训练状态并依次调整训练过程，SASR实现了训练方案间的平滑过渡，在保持核心推理能力的同时探索不同路径。实验结果表明，SASR在性能上优于SFT、RL及静态混合训练方法。

Evaluatiing the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset

Abstract

arXiv:2505.13028v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly integrated into critical systems in industries like healthcare and finance. Users can often submit queries to LLM-enabled chatbots, some of which can enrich responses with information retrieved from internal databases storing sensitive data. This gives rise to a range of attacks in which a user submits a malicious query and the LLM-system outputs a response that creates harm to the owner, such as leaking internal data or creating legal liability by harming a third-party. While security tools are being developed to counter these threats, there is little formal evaluation of their effectiveness and usability. This study addresses this gap by conducting a thorough comparative analysis of LLM security tools. We identified 13 solutions (9 closed-source, 4 open-source), but only 7 were evaluated due to a lack of participation by proprietary model owners.To evaluate, we built a benchmark dataset of malicious prompts, and evaluate these tools performance against a baseline LLM model (ChatGPT-3.5-Turbo). Our results show that the baseline model has too many false positives to be used for this task. Lakera Guard and ProtectAI LLM Guard emerged as the best overall tools showcasing the tradeoff between usability and performance. The study concluded with recommendations for greater transparency among closed source providers, improved context-aware detections, enhanced open-source engagement, increased user awareness, and the adoption of more representative performance metrics.

摘要

大型语言模型（LLMs）正日益融入医疗和金融等关键行业的核心系统。用户通常可以向支持LLM的聊天机器人提交查询，其中部分系统能够通过检索存储敏感数据的内部数据库来丰富响应内容。这引发了一系列攻击行为：用户提交恶意查询后，LLM系统输出的响应可能对所有者造成损害，例如泄露内部数据或通过损害第三方利益引发法律责任。尽管当前正在开发安全工具以应对这些威胁，但对其有效性和可用性的正式评估仍十分匮乏。本研究通过开展LLM安全工具的全面对比分析来填补这一空白。我们识别了13种解决方案（9种闭源、4种开源），但由于专有模型所有者缺乏参与，最终仅评估了7种工具。为进行评估，我们构建了恶意提示的基准数据集，并以基线LLM模型（ChatGPT-3.5-Turbo）为参照评估这些工具的表现。结果显示基线模型因误报率过高而不适用于此任务。Lakera Guard和ProtectAI LLM Guard在可用性与性能的权衡中展现出最佳综合表现。研究最终提出建议：闭源供应商应提高透明度、改进上下文感知检测能力、加强开源社区参与、提升用户安全意识，并采用更具代表性的性能评估指标。

The Hidden Dangers of Browsing AI Agents

Abstract

arXiv:2505.13076v1 Announce Type: cross Abstract: Autonomous browsing agents powered by large language models (LLMs) are increasingly used to automate web-based tasks. However, their reliance on dynamic content, tool execution, and user-provided data exposes them to a broad attack surface. This paper presents a comprehensive security evaluation of such agents, focusing on systemic vulnerabilities across multiple architectural layers. Our work outlines the first end-to-end threat model for browsing agents and provides actionable guidance for securing their deployment in real-world environments. To address discovered threats, we propose a defense in depth strategy incorporating input sanitization, planner executor isolation, formal analyzers, and session safeguards. These measures protect against both initial access and post exploitation attack vectors. Through a white box analysis of a popular open source project, Browser Use, we demonstrate how untrusted web content can hijack agent behavior and lead to critical security breaches. Our findings include prompt injection, domain validation bypass, and credential exfiltration, evidenced by a disclosed CVE and a working proof of concept exploit.

摘要

由大型语言模型（LLM）驱动的自主浏览代理正日益用于自动化基于网络的任务。然而，其对动态内容、工具执行和用户提供数据的依赖使其面临广泛的攻击面。本文对此类代理进行了全面的安全评估，重点关注跨多个架构层的系统性漏洞。我们的工作首次提出了浏览代理的端到端威胁模型，并为实际环境中安全部署提供了可操作的指导。针对发现的威胁，我们提出了一种深度防御策略，包括输入净化、规划器执行器隔离、形式化分析器和会话保护机制。这些措施可防范初始访问和利用后攻击向量。通过对热门开源项目Browser Use的白盒分析，我们展示了不可信网络内容如何劫持代理行为并导致严重安全漏洞。研究发现包括提示注入、域验证绕过和凭据外泄，相关证据包括已披露的CVE编号和一个可运行的概念验证漏洞利用程序。

MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers

Abstract

arXiv:2505.13082v1 Announce Type: cross Abstract: We introduce MultiActor-Audiobook, a zero-shot approach for generating audiobooks that automatically produces consistent, expressive, and speaker-appropriate prosody, including intonation and emotion. Previous audiobook systems have several limitations: they require users to manually configure the speaker's prosody, read each sentence with a monotonic tone compared to voice actors, or rely on costly training. However, our MultiActor-Audiobook addresses these issues by introducing two novel processes: (1) MSP (Multimodal Speaker Persona Generation) and (2) LSI (LLM-based Script Instruction Generation). With these two processes, MultiActor-Audiobook can generate more emotionally expressive audiobooks with a consistent speaker prosody without additional training. We compare our system with commercial products, through human and MLLM evaluations, achieving competitive results. Furthermore, we demonstrate the effectiveness of MSP and LSI through ablation studies.

摘要

我们提出MultiActor-Audiobook，这是一种零样本生成有声书的方法，能自动产生连贯、富有表现力且符合说话者特征的韵律（包括语调与情感）。现有有声书系统存在若干局限：需要用户手动配置说话者韵律、与专业配音演员相比只能以单调语调朗读句子，或依赖成本高昂的训练。而我们的MultiActor-Audiobook通过引入两项创新流程解决了这些问题：(1) MSP（多模态说话者角色生成）与(2) LSI（基于大语言模型的脚本指令生成）。借助这两个流程，本系统无需额外训练即可生成具有情感表现力且保持说话者韵律一致的有声书。通过人类评估与多模态大模型评估，我们的系统与商业产品相比展现出竞争优势。此外，消融实验验证了MSP和LSI的有效性。

Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning

Abstract

arXiv:2505.13115v1 Announce Type: cross Abstract: The popular success of text-based large language models (LLM) has streamlined the attention of the multimodal community to combine other modalities like vision and audio along with text to achieve similar multimodal capabilities. In this quest, large audio language models (LALMs) have to be evaluated on reasoning related tasks which are different from traditional classification or generation tasks. Towards this goal, we propose a novel dataset called temporal reasoning evaluation of audio (TREA). We benchmark open-source LALMs and observe that they are consistently behind human capabilities on the tasks in the TREA dataset. While evaluating LALMs, we also propose an uncertainty metric, which computes the invariance of the model to semantically identical perturbations of the input. Our analysis shows that the accuracy and uncertainty metrics are not necessarily correlated and thus, points to a need for wholesome evaluation of LALMs for high-stakes applications.

摘要

基于文本的大型语言模型（LLM）的广泛成功，促使多模态研究界将注意力转向结合视觉、音频等其他模态与文本，以实现类似的多模态能力。在这一探索中，大型音频语言模型（LALM）需要在与传统分类或生成任务不同的推理相关任务上进行评估。为此，我们提出了一种名为音频时序推理评估（TREA）的新数据集。我们对开源LALM进行了基准测试，发现它们在TREA数据集任务上的表现始终落后于人类能力。在评估LALM时，我们还提出了一种不确定性度量，用于计算模型对输入语义相同扰动的不变性。我们的分析表明，准确性和不确定性度量并不必然相关，因此指出需要对高风险应用中的LALM进行全面评估。

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Abstract

arXiv:2505.13109v1 Announce Type: cross Abstract: Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13 $\times$ speedup compared to SOTA KV retrieval methods.

摘要

大语言模型（LLMs）已被广泛部署，其上下文窗口快速扩展以支持日益 demanding 的应用需求。然而，长上下文带来了显著的部署挑战，主要源于键值缓存（KV cache）的大小随上下文长度成比例增长。虽然已有研究提出KV缓存压缩方法来解决这一问题，但KV丢弃方法会导致显著的精度损失，而KV检索方法则存在严重的效率瓶颈。我们提出FreeKV框架，通过算法-系统协同优化在保持精度的同时提升KV检索效率。算法层面，FreeKV采用推测式检索将KV选择与召回过程移出关键路径，并结合细粒度校正确保精度；系统层面，通过CPU与GPU内存间的混合KV布局消除碎片化数据传输，并利用双缓冲流式召回进一步提升效率。实验表明，FreeKV在多种场景和模型下均实现近乎无损的精度，相比最先进的KV检索方法可获得高达13倍的加速。

Role-Playing Evaluation for Large Language Models

Abstract

arXiv:2505.13157v1 Announce Type: cross Abstract: Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency. This article details the construction of RPEval and presents baseline evaluations. Our code and dataset are available at https://github.com/yelboudouri/RPEval

摘要

大型语言模型（LLMs）在角色扮演和人格化适应方面展现出显著能力。然而，评估这种能力存在重大挑战，因为人工评估资源消耗大，而自动化评估可能存在偏差。为此，我们提出角色扮演评估基准（RPEval），该新颖基准旨在从四个关键维度评估LLM的角色扮演能力：情感理解、决策制定、道德对齐和角色一致性。本文详细阐述了RPEval的构建过程，并提供了基线评估结果。我们的代码与数据集已发布于https://github.com/yelboudouri/RPEval。

ModernGBERT: German-only 1B Encoder Model Trained from Scratch

Abstract

arXiv:2505.13136v1 Announce Type: cross Abstract: Despite the prominence of decoder-only language models, encoders remain crucial for resource-constrained applications. We introduce ModernGBERT (134M, 1B), a fully transparent family of German encoder models trained from scratch, incorporating architectural innovations from ModernBERT. To evaluate the practical trade-offs of training encoders from scratch, we also present LL"aMmlein2Vec (120M, 1B, 7B), a family of encoders derived from German decoder-only models via LLM2Vec. We benchmark all models on natural language understanding, text embedding, and long-context reasoning tasks, enabling a controlled comparison between dedicated encoders and converted decoders. Our results show that ModernGBERT 1B outperforms prior state-of-the-art German encoders as well as encoders adapted via LLM2Vec, with regard to performance and parameter-efficiency. All models, training data, checkpoints and code are publicly available, advancing the German NLP ecosystem with transparent, high-performance encoder models.

摘要

尽管仅解码器语言模型占据主导地位，编码器在资源受限的应用中仍至关重要。我们推出了ModernGBERT（134M、1B）——一个完全透明的德语编码器模型系列，该系列从头开始训练，并融入了ModernBERT的架构创新。为评估从头训练编码器的实际权衡，我们还提出了LL"aMmlein2Vec（120M、1B、7B），这是通过LLM2Vec从德语仅解码器模型衍生出的编码器系列。我们在自然语言理解、文本嵌入和长上下文推理任务上对所有模型进行基准测试，从而实现对专用编码器与转换解码器的受控比较。结果表明，ModernGBERT 1B在性能和参数效率方面均优于先前最先进的德语编码器及通过LLM2Vec适配的编码器。所有模型、训练数据、检查点和代码均已公开，以透明、高性能的编码器模型推动德语NLP生态系统发展。

Tianyi: A Traditional Chinese Medicine all-rounder language model and its Real-World Clinical Practice

Abstract

arXiv:2505.13156v1 Announce Type: cross Abstract: Natural medicines, particularly Traditional Chinese Medicine (TCM), are gaining global recognition for their therapeutic potential in addressing human symptoms and diseases. TCM, with its systematic theories and extensive practical experience, provides abundant resources for healthcare. However, the effective application of TCM requires precise syndrome diagnosis, determination of treatment principles, and prescription formulation, which demand decades of clinical expertise. Despite advancements in TCM-based decision systems, machine learning, and deep learning research, limitations in data and single-objective constraints hinder their practical application. In recent years, large language models (LLMs) have demonstrated potential in complex tasks, but lack specialization in TCM and face significant challenges, such as too big model scale to deploy and issues with hallucination. To address these challenges, we introduce Tianyi with 7.6-billion-parameter LLM, a model scale proper and specifically designed for TCM, pre-trained and fine-tuned on diverse TCM corpora, including classical texts, expert treatises, clinical records, and knowledge graphs. Tianyi is designed to assimilate interconnected and systematic TCM knowledge through a progressive learning manner. Additionally, we establish TCMEval, a comprehensive evaluation benchmark, to assess LLMs in TCM examinations, clinical tasks, domain-specific question-answering, and real-world trials. The extensive evaluations demonstrate the significant potential of Tianyi as an AI assistant in TCM clinical practice and research, bridging the gap between TCM knowledge and practical application.

摘要

天然药物，特别是传统中医（TCM），因其在治疗人类症状和疾病方面的潜力而逐渐获得全球认可。中医凭借其系统理论和丰富的实践经验，为医疗保健提供了丰富的资源。然而，中医的有效应用需要精确的证候诊断、治疗原则的确定和处方制定，这些都需要数十年的临床专业知识。尽管基于中医的决策系统、机器学习和深度学习研究取得了进展，但数据的局限性和单目标约束阻碍了其实际应用。近年来，大语言模型（LLM）在复杂任务中展现出潜力，但缺乏中医领域的专业性，并面临重大挑战，如模型规模过大难以部署和幻觉问题。为解决这些挑战，我们推出了76亿参数的大语言模型“天医”，其模型规模适中且专为中医设计，通过在多样化的中医语料（包括经典文献、专家论著、临床记录和知识图谱）上进行预训练和微调。“天医”旨在通过渐进式学习方式吸收相互关联且系统化的中医知识。此外，我们建立了全面的评估基准TCMEval，用于评估大语言模型在中医考试、临床任务、领域特定问答和实际试验中的表现。大量评估表明，“天医”作为中医临床实践和研究的AI助手具有显著潜力，能够弥合中医知识与实际应用之间的鸿沟。

Cross-Cloud Data Privacy Protection: Optimizing Collaborative Mechanisms of AI Systems by Integrating Federated Learning and LLMs

Abstract

arXiv:2505.13292v1 Announce Type: cross Abstract: In the age of cloud computing, data privacy protection has become a major challenge, especially when sharing sensitive data across cloud environments. However, how to optimize collaboration across cloud environments remains an unresolved problem. In this paper, we combine federated learning with large-scale language models to optimize the collaborative mechanism of AI systems. Based on the existing federated learning framework, we introduce a cross-cloud architecture in which federated learning works by aggregating model updates from decentralized nodes without exposing the original data. At the same time, combined with large-scale language models, its powerful context and semantic understanding capabilities are used to improve model training efficiency and decision-making ability. We've further innovated by introducing a secure communication layer to ensure the privacy and integrity of model updates and training data. The model enables continuous model adaptation and fine-tuning across different cloud environments while protecting sensitive data. Experimental results show that the proposed method is significantly better than the traditional federated learning model in terms of accuracy, convergence speed and data privacy protection.

摘要

在云计算时代，数据隐私保护已成为重大挑战，尤其是在跨云环境共享敏感数据时。然而如何优化跨云环境协作仍是一个未解决的问题。本文结合联邦学习与大规模语言模型，优化AI系统的协同机制。基于现有联邦学习框架，我们引入一种跨云架构，通过聚合分散节点的模型更新而不暴露原始数据来实现联邦学习。同时结合大规模语言模型，利用其强大的上下文和语义理解能力提升模型训练效率与决策能力。我们进一步创新性地引入安全通信层，确保模型更新与训练数据的隐私性和完整性。该模型能在保护敏感数据的同时，实现跨不同云环境的持续模型适配与微调。实验结果表明，所提方法在准确性、收敛速度和数据隐私保护方面显著优于传统联邦学习模型。

ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models

Abstract

arXiv:2505.13176v1 Announce Type: cross Abstract: While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions, overlooking the context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs' capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool utilization. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool utilization significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code are available at https://github.com/Chengziha0/ToolSpectrum.

摘要

尽管将外部工具集成到大型语言模型（LLMs）中增强了其获取实时信息和领域特定服务的能力，但现有方法仅关注遵循用户指令的功能性工具选择，忽视了工具选择中情境感知的个性化。这一疏忽导致用户满意度欠佳和工具利用效率低下，尤其在工具集重叠需要基于情境因素进行细致选择时更为明显。为弥补这一不足，我们提出了ToolSpectrum基准，旨在评估LLMs在个性化工具利用方面的能力。具体而言，我们形式化了用户画像和环境因素这两个个性化关键维度，并分析了它们对工具利用的单独及协同影响。通过在ToolSpectrum上的大量实验，我们证明个性化工具利用能显著提升多样化场景下的用户体验。然而，即使最先进的LLMs也表现出在联合推理用户画像和环境因素方面的有限能力，往往优先考虑一个维度而牺牲另一个。我们的研究结果强调了工具增强型LLMs中情境感知个性化的必要性，并揭示了当前模型的关键局限性。数据与代码详见https://github.com/Chengziha0/ToolSpectrum。

WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?

Abstract

arXiv:2505.13257v1 Announce Type: cross Abstract: Preference alignment has become a standard pipeline in finetuning models to follow \emph{generic} human preferences. Majority of work seeks to optimize model to produce responses that would be preferable \emph{on average}, simplifying the diverse and often \emph{contradicting} space of human preferences. While research has increasingly focused on personalized alignment: adapting models to individual user preferences, there is a lack of personalized preference dataset which focus on nuanced individual-level preferences. To address this, we introduce WikiPersona: the first fine-grained personalization using well-documented, famous individuals. Our dataset challenges models to align with these personas through an interpretable process: generating verifiable textual descriptions of a persona's background and preferences in addition to alignment. We systematically evaluate different personalization approaches and find that as few-shot prompting with preferences and fine-tuning fail to simultaneously ensure effectiveness and efficiency, using \textit{inferred personal preferences} as prefixes enables effective personalization, especially in topics where preferences clash while leading to more equitable generalization across unseen personas.

摘要

偏好对齐已成为微调模型以遵循通用人类偏好的标准流程。大多数研究致力于优化模型，使其生成在平均情况下更受偏好的响应，从而简化了人类偏好多样化且常相互矛盾的特性。尽管研究日益关注个性化对齐（即调整模型以适应个体用户偏好），但目前缺乏专注于细致个体层面偏好的个性化数据集。为此，我们推出WikiPersona：首个基于有据可查的知名人物构建的细粒度个性化数据集。该数据集通过可解释的流程挑战模型与人物角色对齐的能力：除对齐外，还需生成可验证的文本描述来呈现人物背景及其偏好。我们系统评估了不同个性化方法，发现当少量样本提示与微调均无法同时保证效果和效率时，使用推断的个人偏好作为前缀能实现有效个性化，尤其在偏好冲突的主题中表现突出，同时能对未见人物角色实现更公平的泛化。

Abstract

arXiv:2505.13338v1 Announce Type: cross Abstract: Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects. We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. It consists of a pseudo paralinguistic label-based data condensation of in-the-wild speech and LLM-based Contextual Paralinguistic QA (CPQA) generation. The effectiveness is validated by a strong correlation in evaluations of the Qwen2-Audio-7B-Instruct model on a dataset created by our framework and human-generated CPQA dataset. The results also reveal the speech-LLM's limitations in handling empathetic reasoning tasks, highlighting the need for such datasets and more robust models. The proposed framework is first of its kind and has potential in training more robust speech-LLMs with paralinguistic reasoning capabilities.

摘要

当前语音大语言模型在语境推理与副语言理解方面表现有限，这主要源于缺乏同时涵盖这两个方面的问答数据集。我们提出了一种从真实场景语音数据生成数据集的新框架，该框架将语境推理与副语言信息相融合。该框架包含基于伪副语言标签的真实语音数据浓缩，以及基于大语言模型的语境副语言问答生成。通过评估Qwen2-Audio-7B-Instruct模型在我们框架创建的数据集与人工构建的CPQA数据集上的强相关性，验证了该框架的有效性。结果同时揭示了语音大语言模型在处理共情推理任务时的局限性，凸显了对此类数据集及更强健模型的需求。所提出的框架属同类首创，在训练具有副语言推理能力的强健语音大语言模型方面具有潜力。

R3: Robust Rubric-Agnostic Reward Models

Abstract

arXiv:2505.13388v1 Announce Type: cross Abstract: Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce R3, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. R3 enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3

摘要

奖励模型对于使语言模型输出与人类偏好保持一致至关重要，但现有方法往往缺乏可控性和可解释性。这些模型通常针对狭窄目标进行优化，限制了其在更广泛下游任务中的泛化能力。此外，其标量输出若缺乏上下文推理则难以解释。为解决这些局限，我们提出R3——一个新型奖励建模框架，该框架不受评分标准限制、可跨评估维度泛化，并能提供可解释的合理化评分分配。R3实现了对语言模型更透明灵活的评估，支持与多样化人类价值观及使用场景的稳健对齐。我们的模型、数据及代码已在https://github.com/rubricreward/r3开源发布。

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Abstract

arXiv:2505.13308v1 Announce Type: cross Abstract: Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.

摘要

推理能力作为人类智能的核心组成部分，始终是大型语言模型（LLM）实现通用人工智能的重大挑战。尽管模型性能在训练扩展定律下有所提升，但仍存在显著挑战，特别是在训练算法方面（如灾难性遗忘）以及新颖训练数据的有限可用性。作为替代方案，测试时扩展通过增加测试阶段计算量（无需参数更新）来提升推理性能。不同于该范式下先前聚焦于词元空间的方法，我们提出利用潜在空间以实现更高效的推理效果和更好的测试时扩展定律遵循性。我们创新性地提出LatentSeek框架，通过模型潜在空间内的测试时实例级自适应（TTIA）来增强LLM推理能力。具体而言，LatentSeek借助策略梯度方法，在自生成奖励信号的引导下迭代更新潜在表征。我们在GSM8K、MATH-500和AIME2024等多个推理基准测试上，跨多种LLM架构对LatentSeek进行评估。结果表明，LatentSeek始终优于思维链提示和基于微调的方法等强基线。进一步分析显示，LatentSeek具有高效性——对于平均复杂度问题通常能在数次迭代内收敛，同时还能从额外迭代中获益，这凸显了潜在空间测试时扩展的潜力。这些发现使LatentSeek成为提升LLM推理能力的轻量化、可扩展且高效的解决方案。

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning

Abstract

arXiv:2505.13307v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at https://github.com/LightChen233/reasoning-boundary.

摘要

思维链（CoT）推理已被证明能有效增强大语言模型（LLMs）处理复杂任务的能力，这推动了对其底层机制的研究。然而，实际应用仍面临两大挑战：（1）缺乏量化指标和可操作指南来评估和优化CoT能力的可测量边界；（2）尚无方法能评估不可测量的CoT能力边界，例如多模态感知。为解决这些问题，我们提出推理边界框架++（RBF++）。针对第一个挑战，我们将推理边界（RB）定义为CoT性能的最大极限，并提出RB的组合定律，支持跨CoT任务的定量分析并提供可操作的优化指导。对于第二个挑战（尤其是多模态场景），我们引入常数假设，用场景特定常量替代不可测量的RB。此外，我们提出推理边界划分机制，将不可测量的RB划分为两个子边界，从而实现对不可测量领域知识和多模态感知能力的量化与优化。在13项任务中涉及38个模型的广泛实验验证了该框架在跨模态场景中的可行性。我们还评估了10种CoT策略，从两个互补视角揭示了优化与衰退的规律，并扩展了用于衡量LLM推理中RB的评估基准。希望这项工作能推动对LLMs中RB及优化策略的理解。代码与数据详见https://github.com/LightChen233/reasoning-boundary。

J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization

Abstract

arXiv:2505.13346v1 Announce Type: cross Abstract: To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.

摘要

为了适应大语言模型（LLM）快速发展的步伐，模型输出评估已从耗时的人工评估转向自动评估，即由LLM自身对其他模型输出进行评判。LLM-as-judge（法官模型）作为一类生成式评估器，在聊天质量等相对简单领域的评估中表现优异，但在涉及复杂推理的领域（模型回应包含更具实质性和挑战性的内容）时则面临困难。针对现有法官模型的不足，我们探索通过强化学习（RL）训练评估模型的方法。本研究作出三项关键贡献：（1）提出等效初始状态组相对策略优化算法（EIS-GRPO），该算法能有效消除复杂评估场景中产生的位置偏差，使法官模型具备鲁棒性；（2）构建ReasoningJudgeBench基准测试，用于评估法官模型在先前研究未涵盖的多样化推理场景中的表现；（3）训练出推理专用法官模型J4R（7B参数），该模型通过EIS-GRPO训练后，在JudgeBench和ReasoningJudgeBench上的表现分别超越GPT-4o和次优小型法官模型6.7%与9%，其性能达到甚至超过经GRPO训练的大型法官模型。

Thinkless: LLM Learns When to Think

Abstract

arXiv:2505.13379v1 Announce Type: cross Abstract: Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, <short> for concise responses and <think> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless

摘要

推理语言模型凭借其扩展的思维链推理能力，在需要复杂逻辑推断的任务中展现出卓越性能。然而，对所有查询均采用精细推理常导致显著的计算效率低下，尤其当许多问题存在直接解法时。这引出一个开放性问题：大语言模型能否学会何时进行深度思考？为此，我们提出Thinkless框架，通过强化学习范式使模型能基于任务复杂度与自身能力，自适应选择简略回答或详细推理。该框架采用<short>和<think>两个控制标记分别对应两种推理模式，其核心是解耦分组相对策略优化算法（DeGRPO）：将混合推理的学习目标分解为（1）控制标记损失函数——管理推理模式选择；（2）响应损失函数——提升答案准确性。这种解耦机制能精细调控各目标的贡献度，有效稳定训练过程并防止原始GRPO出现的崩溃现象。实验表明，在Minerva Algebra、MATH-500和GSM8K等基准测试中，Thinkless能将长链推理使用量减少50%-90%，显著提升推理语言模型的效率。代码已开源：https://github.com/VainF/Thinkless

Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference

Abstract

arXiv:2505.13345v1 Announce Type: cross Abstract: Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over $40\%$ runtime in large-scale training). In this paper, we first define collaborative communication to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them "collaborated", which comprises $2$ cases as intra- and inter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallelism at scale. It motivates us to strategically optimize collaborative communication for accelerated MoE training and inference, dubbed Occult. Our designs are capable of either delivering exact results with reduced communication cost or controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that Occult can be faster than popular state-of-the-art inference or training frameworks (more than $1.5\times$ speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Code is available at $\href{https://github.com/UNITES-Lab/Occult}{https://github.com/UNITES-Lab/Occult}$ .

摘要

混合专家（MoE）架构通过专家并行化能够实现显著的计算效率，但其高度依赖设备间的全连接通信。遗憾的是，这类通信开销通常占总运行时的很大比重，制约了现代MoE模型分布式训练与推理的扩展性（在大规模训练中消耗超过40%的运行时间）。本文首先定义"协作通信"以阐明这一固有局限，继而提出系统层与算法层的创新方案来降低通信成本。具体而言，当某标记同时激活一对专家时，我们称其为"协作专家"，并根据其是否驻留于同一设备区分为"内部协作"与"跨设备协作"两种情形。初步研究表明，提升内部协作比例可有效增强专家并行化的扩展效率。基于此发现，我们提出战略性地优化协作通信以加速MoE训练与推理的系统Occult。该方案通过两种途径实现：在保证结果精确性的前提下降低通信开销，或采用协作剪枝技术可控地最小化通信成本（通过改进的微调实现）。在多种MoE-LLM上的综合实验表明，Occult相较当前主流推理/训练框架可获得更快的执行速度（在多项任务与模型上实现1.5倍以上加速），且其微调质量与标准方法相当或更优。代码已发布于 $\href{https://github.com/UNITES-Lab/Occult}{https://github.com/UNITES-Lab/Occult}$ 。

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Abstract

arXiv:2505.13438v1 Announce Type: cross Abstract: Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

摘要

扩展测试时计算对于提升大语言模型（LLM）的推理能力至关重要。现有方法通常采用强化学习（RL）来最大化推理轨迹末端可验证的奖励。然而，此类方法仅针对固定大令牌预算下的最终性能进行优化，导致训练和部署效率低下。本研究提出新型框架AnytimeReasoner，以优化任意时刻的推理性能，旨在提升令牌效率及不同令牌预算约束下的推理灵活性。为实现这一目标，我们将完整思维过程截断以适应从先验分布中采样的令牌预算，迫使模型为每个截断思维生成最优答案摘要以供验证。该方法将可验证的密集奖励引入推理过程，从而优化RL中的信用分配效率。随后，我们通过解耦方式分别优化思维策略和摘要策略以最大化累积奖励。此外，我们提出新型方差缩减技术——预算相对策略优化（BRPO），以增强思维策略强化过程中学习过程的鲁棒性和效率。数学推理任务的实验结果表明，在不同先验分布下，我们的方法在所有思维预算条件下均持续优于GRPO，同时提升了训练效率和令牌使用效率。

AdaptThink: Reasoning Models Can Learn When to Think

Abstract

arXiv:2505.13417v1 Announce Type: cross Abstract: Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.

摘要

近期，大型推理模型通过采用类人深度思考机制，在各种任务中展现出卓越性能。然而冗长的思维过程显著增加了推理开销，使效率成为关键瓶颈。本研究首先证明，对于相对简单的任务，直接提示推理模型跳过思考环节并输出最终解决方案的"无思考"模式，在性能和效率上均为更优选择。受此启发，我们提出AdaptThink算法，该强化学习方法能指导推理模型根据问题难度自适应选择最优思考模式。具体而言，AdaptThink包含两个核心组件：(1) 约束优化目标函数，在保持整体性能的前提下鼓励模型选择无思考模式；(2) 重要性采样策略，在策略训练过程中平衡思考与无思考样本，实现冷启动并确保模型在训练全程能探索利用两种模式。实验表明，AdaptThink在显著降低推理成本的同时进一步提升了性能。值得注意的是，在三个数学数据集上，该方法将DeepSeek-R1-Distill-Qwen-1.5B的平均响应长度缩短53%，准确率提升2.4%，凸显了自适应思考模式选择在优化推理质量与效率平衡方面的潜力。代码与模型已开源：https://github.com/THU-KEG/AdaptThink。

CIE: Controlling Language Model Text Generations Using Continuous Signals

Abstract

arXiv:2505.13448v1 Announce Type: cross Abstract: Aligning language models with user intent is becoming increasingly relevant to enhance user experience. This calls for designing methods that can allow users to control the properties of the language that LMs generate. For example, controlling the length of the generation, the complexity of the language that gets chosen, the sentiment, tone, etc. Most existing work attempts to integrate users' control by conditioning LM generations on natural language prompts or discrete control signals, which are often brittle and hard to scale. In this work, we are interested in \textit{continuous} control signals, ones that exist along a spectrum that can't easily be captured in a natural language prompt or via existing techniques in conditional generation. Through a case study in controlling the precise response-length of generations produced by LMs, we demonstrate how after fine-tuning, behaviors of language models can be controlled via continuous signals -- as vectors that are interpolated between a "low" and a "high" token embedding. Our method more reliably exerts response-length control than in-context learning methods or fine-tuning methods that represent the control signal as a discrete signal. Our full open-sourced code and datasets are available at https://github.com/vsamuel2003/CIE.

摘要

将语言模型与用户意图对齐对于提升用户体验正变得愈发重要。这要求设计能够允许用户控制语言模型生成文本属性的方法，例如控制生成内容的长度、所选语言的复杂度、情感倾向及语气等。现有研究大多尝试通过自然语言提示或离散控制信号来调节语言模型生成，但这些方法往往脆弱且难以扩展。本研究聚焦于\textit{连续}控制信号——这类信号存在于一个连续谱系中，难以通过自然语言提示或现有条件生成技术捕捉。通过控制语言模型生成响应长度的案例研究，我们证明经过微调的模型可通过连续信号（即介于"低"与"高"词嵌入之间的插值向量）实现行为控制。相较于上下文学习方法或将控制信号表示为离散信号的微调方法，我们的方法能更可靠地实现响应长度控制。完整开源代码及数据集详见https://github.com/vsamuel2003/CIE。

Learnware of Language Models: Specialized Small Language Models Can Do Big

Abstract

arXiv:2505.13425v1 Announce Type: cross Abstract: The learnware paradigm offers a novel approach to machine learning by enabling users to reuse a set of well-trained models for tasks beyond the models' original purposes. It eliminates the need to build models from scratch, instead relying on specifications (representations of a model's capabilities) to identify and leverage the most suitable models for new tasks. While learnware has proven effective in many scenarios, its application to language models has remained largely unexplored. At the same time, large language models (LLMs) have demonstrated remarkable universal question-answering abilities, yet they face challenges in specialized scenarios due to data scarcity, privacy concerns, and high computational costs, thus more and more specialized small language models (SLMs) are being trained for specific domains. To address these limitations systematically, the learnware paradigm provides a promising solution by enabling maximum utilization of specialized SLMs, and allowing users to identify and reuse them in a collaborative and privacy-preserving manner. This paper presents a preliminary attempt to apply the learnware paradigm to language models. We simulated a learnware system comprising approximately 100 learnwares of specialized SLMs with 8B parameters, fine-tuned across finance, healthcare, and mathematics domains. Each learnware contains an SLM and a specification, which enables users to identify the most relevant models without exposing their own data. Experimental results demonstrate promising performance: by selecting one suitable learnware for each task-specific inference, the system outperforms the base SLMs on all benchmarks. Compared to LLMs, the system outperforms Qwen1.5-110B, Qwen2.5-72B, and Llama3.1-70B-Instruct by at least 14% in finance domain tasks, and surpasses Flan-PaLM-540B (ranked 7th on the Open Medical LLM Leaderboard) in medical domain tasks.

摘要

学习件范式为机器学习提供了一种创新方法，它允许用户将一组训练有素的模型重用于超出其原始设计目的的任务。该范式无需从零开始构建模型，而是通过规范（描述模型能力的表征）来识别并利用最适合新任务的模型。尽管学习件已在诸多场景中验证其有效性，但其在语言模型领域的应用仍鲜有探索。与此同时，大型语言模型（LLMs）虽展现出卓越的通用问答能力，却在专业场景下面临数据稀缺、隐私顾虑和高计算成本等挑战，因此越来越多针对特定领域的专业化小型语言模型（SLMs）被训练出来。为系统性地解决这些局限，学习件范式提供了可行方案：既能最大化利用专业化SLMs，又能让用户在保护隐私的协作模式下识别并重用它们。

本文首次尝试将学习件范式应用于语言模型。我们模拟了一个包含约100个学习件的系统，这些学习件由参数规模为80亿的领域专用SLMs构成，涵盖金融、医疗和数学领域。每个学习件包含一个SLM及其规范，使用户能在不暴露自身数据的情况下识别最相关模型。实验结果表明：通过为每个任务推理选择合适的学习件，该系统在所有基准测试中均优于基础SLMs。相较于LLMs，该系统在金融领域任务中至少领先Qwen1.5-110B、Qwen2.5-72B和Llama3.1-70B-Instruct模型14%，在医疗领域任务中超越Open Medical LLM排行榜第七名的Flan-PaLM-540B模型。

Automating construction contract review using knowledge graph-enhanced large language models

Abstract

arXiv:2309.12132v2 Announce Type: replace Abstract: An effective and efficient review of construction contracts is essential for minimizing construction projects losses, but current methods are time-consuming and error-prone. Studies using methods based on Natural Language Processing (NLP) exist, but their scope is often limited to text classification or segmented label prediction. This paper investigates whether integrating Large Language Models (LLMs) and Knowledge Graphs (KGs) can enhance the accuracy and interpretability of automated contract risk identification. A tuning-free approach is proposed that integrates LLMs with a Nested Contract Knowledge Graph (NCKG) using a Graph Retrieval-Augmented Generation (GraphRAG) framework for contract knowledge retrieval and reasoning. Tested on international EPC contracts, the method achieves more accurate risk evaluation and interpretable risk summaries than baseline models. These findings demonstrate the potential of combining LLMs and KGs for reliable reasoning in tasks that are knowledge-intensive and specialized, such as contract review.

摘要

高效精准的施工合同审查对减少工程项目损失至关重要，但现有方法耗时且易出错。现有基于自然语言处理（NLP）的研究多局限于文本分类或片段标签预测。本文探究大型语言模型（LLMs）与知识图谱（KGs）的融合能否提升合同风险自动识别的准确性与可解释性。提出一种免调优方法，通过图检索增强生成（GraphRAG）框架将LLMs与嵌套合同知识图谱（NCKG）相结合，实现合同知识检索与推理。在国际EPC合同上的测试表明，该方法相比基线模型能提供更精确的风险评估与可解释的风险摘要。这些发现证明了LLMs与KGs在合同审查等知识密集型专业任务中实现可靠推理的潜力。

Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak

Abstract

arXiv:2405.20015v2 Announce Type: replace Abstract: This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreak methods that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) built upon the target LLM. Subsequently, we perform an efficient MLLM jailbreak and obtain a jailbreaking embedding. Finally, we convert the embedding into a textual jailbreaking suffix to carry out the jailbreak of target LLM. Compared to the direct LLM-jailbreak methods, our indirect jailbreaking approach is more efficient, as MLLMs are more vulnerable to jailbreak than pure LLM. Additionally, to improve the attack success rate of jailbreak, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class generalization abilities.

摘要

本文聚焦于针对大语言模型（LLMs）的越狱攻击，旨在诱导模型对有害用户查询生成不当内容。与以往直接面向LLMs的越狱方法不同，我们的方法首先基于目标LLM构建多模态大语言模型（MLLM），随后实施高效的MLLM越狱并获取越狱嵌入向量，最终将该向量转化为文本越狱后缀以实现目标LLM的越狱。相较于直接LLM越狱方法，这种间接越狱途径更具效率，因为MLLM比纯LLM更易受到越狱攻击。此外，为提高越狱攻击成功率，我们提出图像-文本语义匹配方案以筛选合适的初始输入。大量实验表明，本方法在效率与效果上均超越当前最先进的越狱技术，同时展现出更优异的跨类别泛化能力。

Reinforcement Learning: An Overview

Abstract

arXiv:2412.05265v3 Announce Type: replace Abstract: This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based methods, policy-based methods, model-based methods, multi-agent RL, LLMs and RL, and various other topics (e.g., offline RL, hierarchical RL, intrinsic reward).

摘要

本手稿对（深度）强化学习与序列决策领域进行了全面且前沿的概述，涵盖基于价值的方法、基于策略的方法、基于模型的方法、多智能体强化学习、大语言模型与强化学习的结合，以及其他多种主题（如离线强化学习、分层强化学习、内在奖励机制等）。

AXIS: Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents

Abstract

arXiv:2409.17140v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) have enabled LLM-based agents to directly interact with application user interfaces (UIs), enhancing agents' performance in complex tasks. However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework that prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Microsoft Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compared to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and explores a fresh UI design principle for application providers to turn applications into agents in the era of LLMs, paving the way towards an agent-centric operating system (Agent OS).

摘要

多模态大语言模型（MLLMs）使得基于大语言模型的智能体能够直接与应用用户界面（UI）进行交互，从而提升智能体在复杂任务中的表现。然而，由于需要大量连续的UI交互操作，这些智能体往往存在延迟高、可靠性低的问题。为解决该问题，我们提出AXIS框架——一种基于大语言模型的新型智能体框架，其优先通过应用程序接口（API）而非UI操作执行任务。该框架还能通过应用程序的自动化探索来促进API的创建与扩展。我们在Microsoft Word上进行的实验表明，相较于人工操作，AXIS在保持97%-98%准确率的同时，将任务完成时间缩短65%-70%，认知负荷降低38%-53%。本研究提出了新的人-智能体-计算机交互（HACI）框架，并为应用程序提供商探索了将应用转化为大语言模型时代智能体的UI设计新原则，为构建以智能体为中心的操作系统（Agent OS）开辟了道路。

Mitigating Selection Bias with Node Pruning and Auxiliary Options

Abstract

arXiv:2409.18857v2 Announce Type: replace Abstract: Large language models (LLMs) often exhibit systematic preferences for certain answer choices when responding to multiple-choice questions-a behavior known as selection bias. This bias reduces the accuracy and reliability of LLM outputs, limiting their usefulness in decision-critical applications. While prior work has focused on adjusting model inputs or outputs to mitigate this issue, our work takes a fundamentally different approach by identifying and removing the internal sources of bias. We introduce two methods: Bias Node Pruning (BNP), which prunes parameters that contribute to selection bias, and Auxiliary Option Injection (AOI), which introduces an additional answer choice to reduce bias in both white-box and black-box settings. To address the shortcomings of existing evaluation metrics, we propose Choice Kullback-Leibler Divergence (CKLD), a new metric that captures distributional imbalances in model predictions. Experiments on three LLMs across multiple datasets demonstrate that our methods consistently improve answer accuracy while reducing selection bias, providing a robust solution for both open- and closed-source models.

摘要

大语言模型（LLMs）在回答多项选择题时，往往对某些选项表现出系统性偏好——这种行为被称为选择偏差。这种偏差降低了LLM输出的准确性和可靠性，限制了其在决策关键应用中的实用性。虽然先前的研究侧重于通过调整模型输入或输出来缓解这一问题，但我们的工作采取了根本不同的方法，即识别并消除偏差的内部来源。我们提出了两种方法：偏差节点剪枝（BNP），通过剪除导致选择偏差的参数来消除偏差；以及辅助选项注入（AOI），通过引入额外答案选项来减少白盒和黑盒设置中的偏差。针对现有评估指标的不足，我们提出了选择Kullback-Leibler散度（CKLD），这一新指标能够捕捉模型预测中的分布不平衡。在多个数据集上对三种LLM进行的实验表明，我们的方法在降低选择偏差的同时持续提高了答案准确性，为开源和闭源模型提供了稳健的解决方案。

Abstract

arXiv:2406.10504v2 Announce Type: replace Abstract: Given a task in the form of a basic description and its training examples, prompt optimization is the problem of synthesizing the given information into a text prompt for a large language model. Humans solve this problem by also considering the different facets that define a task (e.g., counter-examples, explanations, analogies) and including them in the prompt. However, it is unclear whether existing algorithmic approaches, based on iteratively editing a given prompt or automatically selecting a few in-context examples, can cover the multiple facets required to solve a complex task. In this work, we view prompt optimization as that of learning multiple facets of a task from a set of training examples. We exploit structure in the prompt optimization problem and break down a prompt into loosely coupled semantic sections. The proposed algorithm, UniPrompt, (1) clusters the input space and uses clustered batches so that each batch likely corresponds to a different facet of the task, and (2) utilizes a feedback mechanism to propose adding, editing or deleting a section, which in turn is aggregated over a batch to capture generalizable facets. Empirical evaluation on multiple datasets and a real-world task shows that prompts generated using \shortname{} obtain higher accuracy than human-tuned prompts and those from state-of-the-art methods. In particular, our algorithm can generate long, complex prompts that existing methods are unable to generate. Code for UniPrompt is available at https://aka.ms/uniprompt.

摘要

给定一个由基本描述及其训练示例构成的任务，提示优化问题旨在将给定信息合成为适合大语言模型的文本提示。人类在解决该问题时通常会考虑定义任务的不同方面（如反例、解释、类比）并将其纳入提示。然而，现有基于迭代编辑给定提示或自动选择少量上下文示例的算法方法，是否能覆盖解决复杂任务所需的多重方面尚不明确。本研究将提示优化视为从训练示例中学习任务多重方面的过程。我们利用提示优化问题的结构特征，将提示分解为松散耦合的语义模块。所提出的UniPrompt算法：（1）对输入空间进行聚类并采用分簇批处理，使每批数据可能对应任务的不同方面；（2）利用反馈机制提出添加、编辑或删除模块的建议，通过批量聚合来捕获可泛化的任务方面。在多数据集和真实任务上的实验表明，使用该算法生成的提示准确率优于人工调优提示及现有最优方法产生的提示。特别值得注意的是，本算法能生成现有方法无法实现的长篇复杂提示。UniPrompt代码详见https://aka.ms/uniprompt。

LLMScan: Causal Scan for LLM Misbehavior Detection

Abstract

arXiv:2410.16638v3 Announce Type: replace Abstract: Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

摘要

尽管大型语言模型（LLMs）在各领域取得了成功，但其可能生成不真实、偏见及有害回答的特性带来了显著风险，尤其在关键应用中更为突出。这凸显了对系统化方法以检测和预防此类错误行为的迫切需求。现有研究多针对特定问题（如有害回答），而本研究提出了基于因果分析的创新监控技术LLMScan，提供了一种综合性解决方案。该方法通过因果推理视角系统监测LLM的内部运作机制，其工作原理建立在"LLM'大脑'在错误行为时表现异于常态"的前提上。通过分析输入词元与Transformer层级的因果贡献度，LLMScan能有效识别错误行为。跨多种任务和模型的实验表明，正常行为与错误行为的因果分布存在明显差异，据此可开发出精准、轻量级的检测器，适用于各类错误行为检测任务。

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

Abstract

arXiv:2408.12798v2 Announce Type: replace Abstract: Generative large language models (LLMs) have achieved state-of-the-art results on a wide range of tasks, yet they remain susceptible to backdoor attacks: carefully crafted triggers in the input can manipulate the model to produce adversary-specified outputs. While prior research has predominantly focused on backdoor risks in vision and classification settings, the vulnerability of LLMs in open-ended text generation remains underexplored. To fill this gap, we introduce BackdoorLLM (Our BackdoorLLM benchmark was awarded First Prize in the SafetyBench competition, https://www.mlsafety.org/safebench/winners, organized by the Center for AI Safety, https://safe.ai/.), the first comprehensive benchmark for systematically evaluating backdoor threats in text-generation LLMs. BackdoorLLM provides: (i) a unified repository of benchmarks with a standardized training and evaluation pipeline; (ii) a diverse suite of attack modalities, including data poisoning, weight poisoning, hidden-state manipulation, and chain-of-thought hijacking; (iii) over 200 experiments spanning 8 distinct attack strategies, 7 real-world scenarios, and 6 model architectures; (iv) key insights into the factors that govern backdoor effectiveness and failure modes in LLMs; and (v) a defense toolkit encompassing 7 representative mitigation techniques. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM. We will continuously incorporate emerging attack and defense methodologies to support the research in advancing the safety and reliability of LLMs.

摘要

生成式大型语言模型（LLMs）已在广泛任务中取得最先进成果，但其仍易受后门攻击影响：输入中精心设计的触发器可操纵模型产生攻击者指定的输出。尽管先前研究主要集中于视觉和分类场景中的后门风险，LLMs在开放式文本生成中的脆弱性仍未得到充分探索。为此，我们推出BackdoorLLM（我们的BackdoorLLM基准在由AI安全中心组织的SafetyBench竞赛中获得一等奖，https://www.mlsafety.org/safebench/winners），这是首个系统评估文本生成LLMs后门威胁的综合基准。BackdoorLLM提供：（i）包含标准化训练与评估流程的统一基准库；（ii）多样化的攻击模式套件，包括数据投毒、权重投毒、隐藏状态操纵和思维链劫持；（iii）涵盖8种攻击策略、7个现实场景和6种模型架构的200余项实验；（iv）关于LLMs后门有效性关键因素与失效模式的核心洞察；（v）包含7种代表性防御技术的工具包。代码与数据集详见https://github.com/bboylyg/BackdoorLLM。我们将持续整合新兴攻防方法，以推动LLMs安全性与可靠性的研究进展。

CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Abstract

arXiv:2408.13001v2 Announce Type: replace Abstract: Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models' (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks -- over 95% code generation benchmarks are dominated by Python, leaving the LLMs' capabilities in other programming languages such as Java and C/C++ unknown. Moreover, coding task bias is also crucial. Most benchmarks focus on code generation capability, while benchmarks for code reasoning (given input, reasoning output; and given output, reasoning input), an essential coding capability, are insufficient. Yet, constructing multi-lingual benchmarks can be expensive and labor-intensive, and codes in contest websites such as Leetcode suffer from data contamination during training. To fill this gap, we propose CRUXEVAL-X, a multi-lingual code reasoning benchmark that contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. In particular, the construction pipeline of CRUXEVAL-X works in a fully automated and test-guided manner, which iteratively generates and repairs based on execution feedback. Also, to cross language barriers (e.g., dynamic/static type systems in Python/C++), we formulated various transition rules between language pairs to facilitate translation. Our intensive evaluation of 24 representative LLMs reveals the correlation between language pairs. For example, TypeScript and JavaScript show a significant positive correlation, while Racket has less correlation with other languages. More interestingly, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs.

摘要

代码基准测试（如HumanEval）被广泛用于评估大语言模型（LLMs）的编程能力。然而现有基准存在不可忽视的编程语言偏差——超过95%的代码生成基准以Python为主，导致LLMs在Java、C/C++等其他语言的能力尚不明确。此外，编码任务偏差同样关键：多数基准聚焦代码生成能力，而对代码推理能力（给定输入推导输出，或给定输出反推输入）的测评严重不足。构建多语言基准通常成本高昂，而Leetcode等竞赛网站的代码又存在训练数据污染问题。为此，我们提出CRUXEVAL-X——一个涵盖19种编程语言的多语言代码推理基准，每种语言包含至少600道题目及总计19K内容一致的测试用例。该基准采用全自动化、测试导向的构建流程，基于执行反馈迭代生成与修复。针对语言特性差异（如Python/C++的动态/静态类型系统），我们还制定了语言间转换规则以辅助翻译。通过对24个代表性LLM的密集评估，我们发现了语言对间的相关性：例如TypeScript与JavaScript呈现显著正相关，而Racket与其他语言相关性较弱。更有趣的是，仅接受Python训练的模型在其他语言中最高仍能达到34.4%的Pass@1准确率，这揭示了LLMs的跨语言泛化能力。

Superhuman performance of a large language model on the reasoning tasks of a physician

Abstract

arXiv:2412.10849v2 Announce Type: replace Abstract: A seminal paper published by Ledley and Lusted in 1959 introduced complex clinical diagnostic reasoning cases as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. Here, we report the results of a physician evaluation of a large language model (LLM) on challenging clinical cases against a baseline of hundreds of physicians. We conduct five experiments to measure clinical reasoning across differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, all adjudicated by physician experts with validated psychometrics. We then report a real-world study comparing human expert and AI second opinions in randomly-selected patients in the emergency room of a major tertiary academic medical center in Boston, MA. We compared LLMs and board-certified physicians at three predefined diagnostic touchpoints: triage in the emergency room, initial evaluation by a physician, and admission to the hospital or intensive care unit. In all experiments--both vignettes and emergency room second opinions--the LLM displayed superhuman diagnostic and reasoning abilities, as well as continued improvement from prior generations of AI clinical decision support. Our study suggests that LLMs have achieved superhuman performance on general medical diagnostic and management reasoning, fulfilling the vision put forth by Ledley and Lusted, and motivating the urgent need for prospective trials.

摘要

1959年Ledley和Lusted发表的奠基性论文提出，应将复杂临床诊断推理案例作为评估医学专家计算系统的黄金标准，这一标准沿用至今。本研究报告了针对大型语言模型（LLM）在疑难临床案例中的表现与数百名医师基线的对比评估结果。我们设计了五项实验来测量临床推理能力，包括鉴别诊断生成、诊断推理展示、分诊鉴别诊断、概率推理和管理推理，所有实验均由具有验证心理测量学指标的医师专家进行评审。随后我们开展了一项真实世界研究，在波士顿某大型三级学术医疗中心急诊科随机选取病例，比较人类专家与AI的第二诊疗意见。我们在三个预定义的诊断节点（急诊分诊、医师初步评估、入院或转入重症监护病房）对比了LLM与委员会认证医师的表现。所有实验（包括案例分析和急诊第二诊疗意见）均显示，LLM展现出超越人类水平的诊断与推理能力，且较前代AI临床决策支持系统持续改进。研究表明，LLM在通用医学诊断和管理推理方面已实现超人类表现，实现了Ledley和Lusted提出的愿景，并迫切需要进行前瞻性试验验证。

A Pilot Empirical Study on When and How to Use Knowledge Graphs as Retrieval Augmented Generation

Abstract

arXiv:2502.20854v3 Announce Type: replace Abstract: The integration of Knowledge Graphs (KGs) into the Retrieval Augmented Generation (RAG) framework has attracted significant interest, with early studies showing promise in mitigating hallucinations and improving model accuracy. However, a systematic understanding and comparative analysis of the rapidly emerging KG-RAG methods are still lacking. This paper seeks to lay the foundation for systematically answering the question of when and how to use KG-RAG by analyzing their performance in various application scenarios associated with different technical configurations. After outlining the mind map using KG-RAG framework and summarizing its popular pipeline, we conduct a pilot empirical study of KG-RAG works to reimplement and evaluate 6 KG-RAG methods across 9 datasets in diverse domains and scenarios, analyzing the impact of 9 KG-RAG configurations in combination with 17 LLMs, and combining Metacognition with KG-RAG as a pilot attempt. Our results underscore the critical role of appropriate application conditions and optimal configurations of KG-RAG components.

摘要

将知识图谱（KGs）融入检索增强生成（RAG）框架的研究引起了广泛关注，早期研究表明该方法在减少幻觉现象和提高模型准确性方面具有潜力。然而，目前仍缺乏对快速兴起的KG-RAG方法的系统性理解和对比分析。本文通过分析不同技术配置下KG-RAG在各类应用场景中的表现，旨在为"何时及如何使用KG-RAG"这一核心问题建立研究基础。在概述KG-RAG框架的思维导图并总结其主流流程后，我们对KG-RAG研究进行了实证探索：重新实现并评估了跨9个领域数据集的6种KG-RAG方法，结合17种大语言模型分析了9种KG-RAG配置的影响，并首次尝试将元认知与KG-RAG相结合。研究结果突显了适用条件选择和组件优化配置对KG-RAG效能的关键作用。

Abstract

arXiv:2502.11799v2 Announce Type: replace Abstract: Despite the remarkable capabilities of large language models (LLMs) in various reasoning tasks, they still struggle with table reasoning tasks, particularly in maintaining consistency throughout multi-step reasoning processes. While existing approaches have explored various decomposition strategies, they often lack effective mechanisms to identify and correct errors in intermediate reasoning steps, leading to cascading error propagation. To address these issues, we propose Table-Critic, a novel multi-agent framework that facilitates collaborative criticism and iterative refinement of the reasoning process until convergence to correct solutions. Our framework consists of four specialized agents: a Judge for error identification, a Critic for comprehensive critiques, a Refiner for process improvement, and a Curator for pattern distillation. To effectively deal with diverse and unpredictable error types, we introduce a self-evolving template tree that systematically accumulates critique knowledge through experience-driven learning and guides future reflections. Extensive experiments have demonstrated that Table-Critic achieves substantial improvements over existing methods, achieving superior accuracy and error correction rates while maintaining computational efficiency and lower solution degradation rate.

摘要

尽管大型语言模型（LLMs）在各种推理任务中展现出卓越能力，但其在表格推理任务中仍存在困难，尤其是在多步推理过程中保持一致性方面。现有方法虽探索了多种分解策略，但往往缺乏有效机制来识别和纠正中间推理步骤的错误，导致错误级联传播。为解决这些问题，我们提出Table-Critic——一个新颖的多智能体框架，通过协作批判与迭代优化推动推理过程直至收敛至正确解。该框架包含四个专业智能体：负责错误识别的法官（Judge）、提供全面批判的评论家（Critic）、实施过程改进的优化者（Refiner）以及进行模式提炼的策展人（Curator）。为有效应对多样且不可预测的错误类型，我们引入自进化模板树系统，通过经验驱动学习积累批判知识，并指导未来反思。大量实验表明，Table-Critic相较现有方法取得显著提升，在保持计算效率和较低解退化率的同时，实现了更高的准确率和错误纠正率。

ARS: Automatic Routing Solver with Large Language Models

Abstract

arXiv:2502.15359v3 Announce Type: replace Abstract: Real-world Vehicle Routing Problems (VRPs) are characterized by a variety of practical constraints, making manual solver design both knowledge-intensive and time-consuming. Although there is increasing interest in automating the design of routing algorithms, existing research has explored only a limited array of VRP variants and fails to adequately address the complex and prevalent constraints encountered in real-world situations. To fill this gap, this paper introduces RoutBench, a benchmark of 1,000 VRP variants derived from 24 attributes, for evaluating the effectiveness of automatic routing solvers in addressing complex constraints. Along with RoutBench, we present the Automatic Routing Solver (ARS), which employs Large Language Model (LLM) agents to enhance a backbone algorithm framework by automatically generating constraint-aware heuristic code, based on problem descriptions and several representative constraints selected from a database. Our experiments show that ARS outperforms state-of-the-art LLM-based methods and commonly used solvers, automatically solving 91.67% of common VRPs and achieving at least a 30% improvement across all benchmarks.

摘要

现实世界中的车辆路径问题（VRP）通常涉及多种实际约束条件，这使得手动设计求解器既需要专业知识又耗时费力。尽管自动化设计路径算法的研究日益受到关注，但现有工作仅探索了有限的VRP变体，且未能充分应对现实场景中复杂而普遍的约束条件。为填补这一空白，本文提出RoutBench基准测试集——一个包含24种属性衍生的1000种VRP变体的评估体系，用于检验自动路径求解器处理复杂约束的有效性。我们同步开发了自动路由求解器（ARS），该系统利用大语言模型（LLM）代理，通过问题描述和从数据库选取的代表性约束条件，自动生成具有约束感知能力的启发式代码，从而增强主干算法框架。实验表明，ARS在性能上优于当前最先进的基于LLM的方法及常用求解器，可自动解决91.67%的常见VRP问题，并在所有基准测试中实现至少30%的性能提升。

FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference

Abstract

arXiv:2502.15804v2 Announce Type: replace Abstract: KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention heads across GPUs using data parallelism to mitigate load imbalance. Our experiments on popular models, including LLaMA 70b and Mistral 24b model, demonstrate that FairKV increases throughput by 1.66x compared to standard tensor parallelism inference. Our code will be released as open source upon acceptance.

摘要

Transformer模型中的KV缓存技术旨在通过显著增加内存使用来减少冗余计算，这使得KV缓存压缩成为一个重要且热门的研究课题。近期，最先进的KV缓存压缩方法采用了不平衡的逐头分配算法，动态调整每个注意力头的KV缓存预算，在单GPU场景中表现出色。然而，我们观察到，这种不平衡压缩在部署多GPU推理时会导致严重的负载不均，部分GPU负担过重，而其他GPU则利用率不足。本文提出FairKV方法，旨在采用不平衡KV缓存压缩的系统中确保注意力头之间的内存使用公平性。FairKV的核心技术是公平复制（Fair-Copying），即通过数据并行将一小部分内存密集型注意力头复制到多个GPU上，从而缓解负载不均问题。我们在LLaMA 70b和Mistral 24b等主流模型上的实验表明，与标准张量并行推理相比，FairKV将吞吐量提升了1.66倍。代码将在论文录用后开源。

The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems

Abstract

arXiv:2502.16565v2 Announce Type: replace Abstract: Consensus formation is pivotal in multi-agent systems (MAS), balancing collective coherence with individual diversity. Conventional LLM-based MAS primarily rely on explicit coordination, e.g., prompts or voting, risking premature homogenization. We argue that implicit consensus, where agents exchange information yet independently form decisions via in-context learning, can be more effective in dynamic environments that require long-horizon adaptability. By retaining partial diversity, systems can better explore novel strategies and cope with external shocks. We formalize a consensus-diversity tradeoff, showing conditions where implicit methods outperform explicit ones. Experiments on three scenarios -- Dynamic Disaster Response, Information Spread and Manipulation, and Dynamic Public-Goods Provision -- confirm partial deviation from group norms boosts exploration, robustness, and performance. We highlight emergent coordination via in-context learning, underscoring the value of preserving diversity for resilient decision-making.

摘要

共识形成是多智能体系统（MAS）中的关键环节，需要在集体一致性与个体多样性之间取得平衡。传统基于大语言模型的多智能体系统主要依赖显式协调（如提示或投票），存在过早同质化的风险。我们认为，在需要长期适应性的动态环境中，隐式共识——即智能体通过上下文学习交换信息但独立形成决策——可能更为有效。通过保留部分多样性，系统能够更好地探索新策略并应对外部冲击。我们形式化提出了共识-多样性权衡关系，论证了隐式方法优于显式方法的条件。在三个实验场景（动态灾害响应、信息传播与操纵、动态公共物品供给）中的结果表明，适度偏离群体规范能提升探索能力、鲁棒性和整体性能。我们特别强调了通过上下文学习涌现出的协调机制，这印证了保持多样性对于构建弹性决策系统的重要价值。

KunServe: Parameter-centric Memory Management for Efficient Memory Throttling Handling in LLM Serving

Abstract

arXiv:2412.18169v3 Announce Type: replace Abstract: Serving LLMs with a cluster of GPUs is common nowadays, where the serving system must meet strict latency SLOs required by applications. However, the stateful nature of LLM serving requires maintaining huge states (i.e., KVCache) in limited GPU memory. Under spikes in real-world workloads, GPU memory can be easily throttled, leading to orders of magnitude higher response latency due to queuing introduced by waiting for KVCache to be reclaimed. Prior KVCache-centric approaches handle load throttling by dropping, migrating, or swapping KVCache. These methods fail to release sufficient memory quickly with requests still queued. This paper proposes the first parameter-centric approach to handling throttling by selectively dropping replicated parameters to instantly free memory for requests, based on an unnoticed observation that model parameters are commonly replicated across GPUs for serving LLMs. With additional memory, all requests can be served with a larger batch without queuing. To make the parameter-centric approach correct and efficient, we cooperatively execute requests on GPUs with a complete copy of parameters using pipeline parallelism, and derive an appropriate drop plan without unnecessary cooperation. We also design techniques to minimize the performance overhead due to pipeline parallelism with the execution patterns of requests under drop. Evaluations show that {\sys} reduces the tail TTFT of requests under throttling by up to 72.2 times compared to the state-of-the-art systems including Llumnix, vLLM and InferCept.

摘要

当前，利用GPU集群服务大型语言模型（LLM）已成为普遍做法，其服务系统必须满足应用严格的延迟服务等级目标（SLO）。然而，LLM服务的有状态特性要求在有限的GPU内存中维护庞大状态（即KV缓存）。在真实工作负载突发情况下，GPU内存极易达到瓶颈，因等待KV缓存回收而引入的队列等待会导致响应延迟激增数个数量级。现有以KV缓存为核心的方法通过丢弃、迁移或交换KV缓存来处理负载瓶颈，但这些方法无法在请求仍处于队列时快速释放足够内存。本文首次提出以参数为核心的处理方法：基于"模型参数通常在GPU间复制以服务LLM"这一未被关注的观察，通过选择性丢弃复制参数来即时释放内存供请求使用。获得额外内存后，所有请求均能以更大批次处理而无需排队。为实现参数中心方法的正确性与高效性，我们采用流水线并行在具备完整参数副本的GPU上协同执行请求，并推导出无需冗余协作的优化丢弃方案。此外，针对参数丢弃后的请求执行模式，我们设计了最小化流水线并行性能开销的技术。评估表明，与Llumnix、vLLM和InferCept等最先进系统相比，本系统在内存瓶颈情况下将请求的尾部首令牌延迟（TTFT）最高降低72.2倍。

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

Abstract

arXiv:2503.09780v2 Announce Type: replace Abstract: Autonomous AI agents that can follow instructions and perform complex multi-step tasks have tremendous potential to boost human productivity. However, to perform many of these tasks, the agents need access to personal information from their users, raising the question of whether they are capable of using it appropriately. In this work, we introduce a new benchmark AgentDAM that measures if AI web-navigation agents follow the privacy principle of data minimization''. For the purposes of our benchmark, data minimization means that the agent uses a piece of potentially sensitive information only if it is necessary'' to complete a particular task. Our benchmark simulates realistic web interaction scenarios end-to-end and is adaptable to all existing web navigation agents. We use AgentDAM to evaluate how well AI agents built on top of GPT-4, Llama-3 and Claude can limit processing of potentially private information, and show that they are prone to inadvertent use of unnecessary sensitive information. We also propose a prompting-based defense that reduces information leakage, and demonstrate that our end-to-end benchmarking provides a more realistic measure than probing LLMs about privacy. Our results highlight that further research is needed to develop AI agents that can prioritize data minimization at inference time.

摘要

能够遵循指令并执行复杂多步任务的自主AI代理具有提升人类生产力的巨大潜力。然而在执行许多此类任务时，代理需要获取用户的个人信息，这引发了其是否能够恰当使用这些信息的问题。本研究提出了新型基准测试AgentDAM，用于评估AI网页导航代理是否遵守"数据最小化"隐私原则。在本基准框架下，数据最小化意味着代理仅当"必要"时才使用可能敏感的特定信息来完成给定任务。该基准通过端到端方式模拟真实网页交互场景，可适配所有现有网页导航代理。我们运用AgentDAM评估了基于GPT-4、Llama-3和Claude构建的AI代理在限制处理潜在隐私信息方面的表现，发现它们存在非必要敏感信息的无意使用现象。同时提出一种基于提示的防御方法以减少信息泄露，并证明端到端基准测试比直接探测大语言模型的隐私性更能反映真实情况。研究结果表明，需要进一步开发能在推理阶段优先实现数据最小化的AI代理。

Beyond Single Pass, Looping Through Time: KG-IRAG with Iterative Knowledge Retrieval

Abstract

arXiv:2503.14234v3 Announce Type: replace Abstract: Graph Retrieval-Augmented Generation (GraphRAG) has proven highly effective in enhancing the performance of Large Language Models (LLMs) on tasks that require external knowledge. By leveraging Knowledge Graphs (KGs), GraphRAG improves information retrieval for complex reasoning tasks, providing more precise and comprehensive retrieval and generating more accurate responses to QAs. However, most RAG methods fall short in addressing multi-step reasoning, particularly when both information extraction and inference are necessary. To address this limitation, this paper presents Knowledge Graph-Based Iterative Retrieval-Augmented Generation (KG-IRAG), a novel framework that integrates KGs with iterative reasoning to improve LLMs' ability to handle queries involving temporal and logical dependencies. Through iterative retrieval steps, KG-IRAG incrementally gathers relevant data from external KGs, enabling step-by-step reasoning. The proposed approach is particularly suited for scenarios where reasoning is required alongside dynamic temporal data extraction, such as determining optimal travel times based on weather conditions or traffic patterns. Experimental results show that KG-IRAG improves accuracy in complex reasoning tasks by effectively integrating external knowledge with iterative, logic-based retrieval. Additionally, three new datasets: weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW, are formed to evaluate KG-IRAG's performance, demonstrating its potential beyond traditional RAG applications.

摘要

图检索增强生成（GraphRAG）已被证明能显著提升大语言模型（LLM）在需要外部知识的任务中的表现。该方法通过利用知识图谱（KG）改进复杂推理任务的信息检索，提供更精准全面的检索结果，并生成更准确的问答响应。然而，现有大多数RAG方法难以处理多步推理任务，尤其是当同时需要信息抽取与逻辑推断时。针对这一局限，本文提出基于知识图谱的迭代检索增强生成框架（KG-IRAG），该创新框架将知识图谱与迭代推理相结合，增强LLM处理涉及时间与逻辑依赖查询的能力。KG-IRAG通过迭代检索步骤从外部知识图谱中逐步收集相关数据，实现分步推理。该方案特别适用于需要结合动态时序数据抽取进行推理的场景，例如根据天气条件或交通模式确定最佳出行时间。实验结果表明，KG-IRAG通过将外部知识与基于逻辑的迭代检索有效整合，显著提高了复杂推理任务的准确性。此外，本研究构建了三个新数据集——weatherQA-Irish、weatherQA-Sydney和trafficQA-TFNSW用于评估KG-IRAG性能，验证了其在传统RAG应用场景之外的潜力。

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Abstract

arXiv:2503.15558v3 Announce Type: replace Abstract: Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.

摘要

物理人工智能系统需要感知、理解并在物理世界中执行复杂动作。本文提出的Cosmos-Reason1模型能够理解物理世界，并通过长链思维推理过程以自然语言生成恰当的具身决策（例如下一步动作）。我们首先定义了物理AI推理的关键能力，重点关注物理常识和具身推理。为表征物理常识，我们采用分层本体论来捕捉关于空间、时间和物理学的基础知识；针对具身推理，我们基于二维本体论实现不同物理具身的泛化。基于这些能力，我们开发了Cosmos-Reason1-7B和Cosmos-Reason1-56B两个多模态大语言模型。通过两阶段训练：物理AI监督微调（SFT）和物理AI强化学习（RL），我们构建数据集并训练模型。根据本体论框架，我们建立了物理常识与具身推理的综合评估基准。实验结果表明，物理AI SFT和RL带来了显著性能提升。为促进物理AI发展，我们在NVIDIA开放模型许可下公开代码与预训练模型，详见https://github.com/nvidia-cosmos/cosmos-reason1。

A Self-Improving Coding Agent

Abstract

arXiv:2504.15228v2 Announce Type: replace Abstract: Recent advancements in Large Language Models (LLMs) have spurred interest in deploying LLM agents to undertake tasks in the world. LLMs are often deployed in agent systems: code that orchestrates LLM calls and provides them with tools. We demonstrate that an agent system, equipped with basic coding tools, can autonomously edit itself, and thereby improve its performance on benchmark tasks. We find performance gains from 17% to 53% on a random subset of SWE Bench Verified, with additional performance gains on LiveCodeBench, as well as synthetically generated agent benchmarks. Our work represents an advancement in the automated and open-ended design of agentic systems, and demonstrates a data-efficient, non gradient-based learning mechanism driven by LLM reflection and code updates.

摘要

大型语言模型（LLMs）的最新进展引发了人们对部署LLM代理执行现实世界任务的兴趣。LLMs通常被部署在代理系统中：这类系统通过代码协调LLM调用并为其提供工具。我们证明，配备基本编码工具的代理系统能够自主修改自身代码，从而提升其在基准任务上的性能。在SWE Bench Verified随机子集上，我们观察到性能提升幅度从17%到53%不等，在LiveCodeBench以及合成生成的代理基准测试中亦获得额外性能提升。本研究代表了自主开放式代理系统设计领域的进步，并展示了一种由LLM自我反思和代码更新驱动的高效数据利用、非基于梯度的学习机制。

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

Abstract

arXiv:2504.09702v2 Announce Type: replace Abstract: We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench

摘要

我们推出MLRC-Bench基准测试，旨在量化语言代理解决具有挑战性的机器学习（ML）研究竞赛的能力，重点关注需要新方法的开放研究问题。与先前工作（如AI Scientist）使用LLM作为评判端到端代理流程不同，MLRC-Bench通过严格协议和客观指标，衡量提出与实现新研究方法的关键步骤。我们精选的7项竞赛任务揭示了LLM代理面临的重大挑战：表现最佳的测试代理（MLAB框架下的gemini-exp-1206）仅缩小基线分数与人类顶尖参与者成绩之间9.3%的差距。此外，分析表明LLM评判的创新性与前沿ML研究问题的实际表现存在偏差。MLRC-Bench是一个动态基准，设计为可随新ML竞赛扩展，以促进对AI研究能力的严格客观评估。排行榜与代码详见：https://huggingface.co/spaces/launch/MLRC_Bench

Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws

Abstract

arXiv:2504.09597v5 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors of LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.

摘要

尽管大语言模型（LLMs）在众多任务中展现出卓越能力，但其内在机制及诸如缩放定律、幻觉现象等相关行为的原理性解释仍不明确。本研究基于柯尔莫哥洛夫复杂度与香农信息论，重新审视压缩与预测之间的经典关系，从而深入理解LLMs的行为特性。通过运用柯尔莫哥洛夫结构函数，并将LLMs的压缩过程解释为两阶段编码机制，我们详细揭示了LLMs如何随模型与数据规模增长获取并存储信息——从普遍存在的句法模式到逐渐稀疏的知识要素。受此理论视角及赫普定律、齐普夫定律启发的自然假设驱动，我们提出一个简化但具代表性的分层数据生成框架——'句法-知识'模型。在贝叶斯设定下，我们证明该模型中的预测与压缩过程会自然导致LLMs多样化的学习行为与缩放特性。特别地，理论分析为数据与模型缩放定律、训练与微调过程中的知识获取动态、以及LLMs的事实性知识幻觉现象提供了直观且原理性的解释。实验结果验证了我们的理论预测。

Signatures of human-like processing in Transformer forward passes

Abstract

arXiv:2504.14107v2 Announce Type: replace Abstract: Modern AI models are increasingly being used as theoretical tools to study human cognition. One dominant approach is to evaluate whether human-derived measures are predicted by a model's output: that is, the end-product of a forward pass. However, recent advances in mechanistic interpretability have begun to reveal the internal processes that give rise to model outputs, raising the question of whether models might use human-like processing strategies. Here, we investigate the relationship between real-time processing in humans and layer-time dynamics of computation in Transformers, testing 20 open-source models in 6 domains. We first explore whether forward passes show mechanistic signatures of competitor interference, taking high-level inspiration from cognitive theories. We find that models indeed appear to initially favor a competing incorrect answer in the cases where we would expect decision conflict in humans. We then systematically test whether forward-pass dynamics predict signatures of processing in humans, above and beyond properties of the model's output probability distribution. We find that dynamic measures improve prediction of human processing measures relative to static final-layer measures. Moreover, across our experiments, larger models do not always show more human-like processing patterns. Our work suggests a new way of using AI models to study human cognition: not just as a black box mapping stimuli to responses, but potentially also as explicit processing models.

摘要

现代人工智能模型正日益被用作研究人类认知的理论工具。主流方法之一是评估模型输出（即前向传播的最终结果）是否能预测人类行为指标。然而，机制可解释性领域的最新进展逐渐揭示了模型产生输出的内部过程，这引发了一个新问题：模型是否可能采用类人的处理策略？本研究探究了人类实时处理与Transformer模型层级计算动态之间的关系，在6个领域测试了20个开源模型。我们首先基于认知理论的高层次启发，探索前向传播是否展现竞争干扰的机制特征。结果发现，在人类预期会出现决策冲突的情况下，模型确实会先偏向错误的竞争答案。随后我们系统检验了前向传播动态能否预测人类处理特征——其预测能力是否超越模型输出概率分布的静态特性。研究表明，动态测量指标相较于静态最终层指标能更好地预测人类处理行为。值得注意的是，在所有实验中，更大规模的模型并不总是表现出更接近人类的处理模式。这项工作提出了一种利用AI模型研究人类认知的新范式：不仅将其视为从刺激到响应的黑箱映射，还可能作为显式的处理过程模型。

GVPO: Group Variance Policy Optimization for Large Language Model Post-Training

Abstract

arXiv:2504.19599v2 Announce Type: replace Abstract: Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their practical adoption. To address this challenge, we present Group Variance Policy Optimization (GVPO). GVPO incorporates the analytical solution to KL-constrained reward maximization directly into its gradient weights, ensuring alignment with the optimal policy. The method provides intuitive physical interpretations: its gradient mirrors the mean squared error between the central distance of implicit rewards and that of actual rewards. GVPO offers two key advantages: (1) it guarantees a unique optimal solution, exactly the KL-constrained reward maximization objective, (2) it supports flexible sampling distributions that avoids on-policy and importance sampling limitations. By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.

摘要

后训练在精调和对齐大语言模型以适应特定任务和人类偏好方面起着关键作用。尽管近期后训练技术（如组相对策略优化GRPO）通过增加采样并结合相对奖励评分实现了优异性能，但这些方法常因训练不稳定性而限制其实际应用。为解决这一问题，我们提出组方差策略优化（GVPO）。该方法将KL约束奖励最大化的解析解直接融入梯度权重，确保与最优策略对齐。其物理意义直观明确：梯度反映了隐式奖励中心距离与实际奖励中心距离的均方误差。GVPO具有两大核心优势：（1）保证存在唯一最优解，严格满足KL约束奖励最大化目标；（2）支持灵活采样分布，规避了同策略与重要性采样的局限性。通过理论保证与实践适应性的统一，GVPO为可靠且多功能的大语言模型后训练建立了新范式。

OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training

Abstract

arXiv:2504.09844v2 Announce Type: replace Abstract: Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. Under multisource preprocessing, two fundamental challenges exist. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism. We present Omniload, an industrial-grade distributed data loading architecture for LFMs, with four innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for elastic multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. (4) Shadow loaders with differential checkpointing for fault recovery without workflow interruption. Deployed on production clusters scaling to multi-thousand GPUs, Omniload achieves: (1) 4.5x end-to-end training throughput improvement, (2) 13.5x reduction in CPU memory usage.

摘要

现代大规模基础模型（LFM）训练框架采用数据并行方式的数据加载器，每个加载器处理训练数据的互斥子集。在多源预处理场景下存在两个根本性挑战：首先，由于注意力算子的二次计算复杂度，数据并行节点间非均匀的样本分布会导致加载器间显著的工作负载不平衡，降低训练效率；其次，支持多样化数据源需要为每个数据集维护文件访问状态，这些状态在并行加载器间冗余复制，消耗过量内存。这同时阻碍了动态数据混合（如课程学习），并在混合并行场景下造成冗余访问/内存开销。我们提出Omniload——一个工业级LFM分布式数据加载架构，包含四项创新：（1）通过角色化执行器（源加载器/数据构造器）实现解耦的数据预处理，消除数据源与并行化带来的冗余访问，确保多源可扩展性；（2）集中式声明性数据平面支持弹性多源编排，如长短上下文、多模态及课程学习；（3）异构预处理成本下的源加载器多级自动分区与扩展机制；（4）采用差异检查点的影子加载器实现无工作流中断的故障恢复。在扩展至数千GPU的生产集群中部署后，Omniload实现：（1）端到端训练吞吐量提升4.5倍；（2）CPU内存使用量降低13.5倍。

Large Linguistic Models: Investigating LLMs' metalinguistic abilities

Abstract

arXiv:2305.00948v4 Announce Type: replace-cross Abstract: The performance of large language models (LLMs) has recently improved to the point where models can perform well on many language tasks. We show here that--for the first time--the models can also generate valid metalinguistic analyses of language data. We outline a research program where the behavioral interpretability of LLMs on these tasks is tested via prompting. LLMs are trained primarily on text--as such, evaluating their metalinguistic abilities improves our understanding of their general capabilities and sheds new light on theoretical models in linguistics. We show that OpenAI's (2024) o1 vastly outperforms other models on tasks involving drawing syntactic trees and phonological generalization. We speculate that OpenAI o1's unique advantage over other models may result from the model's chain-of-thought mechanism, which mimics the structure of human reasoning used in complex cognitive tasks, such as linguistic analysis.

摘要

大型语言模型（LLMs）的性能近期已提升至能在多项语言任务中表现出色的程度。本文首次证明这些模型还能生成有效的语言数据元语言分析。我们提出一项研究计划，通过提示测试LLMs在此类任务中的行为可解释性。由于LLMs主要基于文本训练，评估其元语言能力不仅能增进对其整体功能的理解，也为语言学理论模型提供了新视角。研究表明，OpenAI（2024）的o1模型在句法树绘制和音系概括任务上显著优于其他模型。我们推测，o1模型的独特优势可能源于其思维链机制——该机制模拟了人类在语言分析等复杂认知任务中的推理结构。

Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey

Abstract

arXiv:2505.01821v2 Announce Type: replace Abstract: Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.

摘要

边缘-云协同计算（ECCC）作为一种关键范式，通过整合云端资源与边缘设备来满足现代智能应用的计算需求，实现高效低延迟处理。人工智能尤其是深度学习与大语言模型（LLMs）的最新进展显著提升了这些分布式系统的能力，但同时也给模型部署与资源管理带来了重大挑战。本综述系统探讨了边缘-云环境中分布式智能与模型优化的交叉领域，对基础架构、使能技术和新兴应用进行了结构化梳理。我们详细分析了包括模型压缩、自适应优化和神经架构搜索在内的优化方法，以及平衡性能、能效与延迟需求的AI驱动资源管理策略。进一步研究了ECCC系统中隐私保护与安全增强的关键技术，并通过自动驾驶、医疗健康和工业自动化等多样化应用考察实际部署方案。同时深入探讨了性能分析与基准测试技术，为这类复杂系统建立评估标准。此外，本文指出LLMs部署、6G融合、神经形态计算和量子计算等关键研究方向，为解决异构性管理、实时处理和可扩展性等长期挑战提供路线图。通过连接理论进展与实际应用，本综述为研究者与实践者提供了利用AI优化分布式计算环境的整体视角，推动新一代智能系统的创新发展。

AlignRAG: Leveraging Critique Learning for Evidence-Sensitive Retrieval-Augmented Reasoning

Abstract

arXiv:2504.14858v2 Announce Type: replace Abstract: Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs). However, standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions. In this work, we reinterpret RAG as Retrieval-Augmented Reasoning and identify a central but underexplored problem: \textit{Reasoning Misalignment}-the divergence between an LLM's internal reasoning trajectory and the evidential constraints provided by retrieval. To address this issue, we propose \textsc{AlignRAG}, a novel iterative framework grounded in Critique-Driven Alignment (CDA). At the heart of \textsc{AlignRAG} lies a \textit{contrastive critique synthesis} mechanism that generates retrieval-sensitive critiques while mitigating self-bias. This mechanism trains a dedicated retrieval-augmented \textit{Critic Language Model (CLM)} using labeled critiques that distinguish between evidence-aligned and misaligned reasoning. Alignment signals for supervision are obtained through self-supervised or externally guided labeling strategies. The resulting CLM is explicitly optimized for evidence sensitivity, enabling it to detect and revise reasoning errors during inference without relying solely on self-generated feedback. Empirical evaluations show that our 8B-parameter CLM improves performance over the Self-Refine baseline by 12.1% on out-of-domain tasks and outperforms a standard 72B-parameter CLM by 2.2%, while remaining compatible with existing RAG architectures as a plug-and-play module. Overall, AlignRAG offers a principled solution for aligning model reasoning with retrieved evidence, substantially improving the factual reliability and robustness of RAG systems.

摘要

检索增强生成（RAG）已成为实现知识驱动大型语言模型（LLM）的广泛采用范式。然而，标准RAG流程往往无法确保模型推理与检索证据保持一致，导致事实不一致或结论缺乏支持。本研究将RAG重新诠释为检索增强推理，并揭示了一个核心但未被充分探索的问题：\textit{推理错位}——即LLM内部推理轨迹与检索提供的证据约束之间的偏差。为解决该问题，我们提出\textsc{AlignRAG}，这是一个基于批判驱动对齐（CDA）的新型迭代框架。\textsc{AlignRAG}的核心在于\textit{对比批判合成}机制，该机制在生成检索敏感批判的同时能缓解自我偏差。通过使用区分证据对齐与错位推理的标注批判，该机制训练了专用的检索增强型\textit{批判语言模型（CLM）}。监督对齐信号通过自监督或外部引导标注策略获得。所构建的CLM被显式优化以具备证据敏感性，使其能在推理过程中检测并修正推理错误，而不完全依赖自我生成的反馈。实证评估表明，我们的80亿参数CLM在域外任务上比Self-Refine基线性能提升12.1%，且以2.2%的优势超越标准720亿参数CLM，同时可作为即插即用模块与现有RAG架构兼容。总体而言，AlignRAG为模型推理与检索证据的对齐提供了原则性解决方案，显著提升了RAG系统的事实可靠性与鲁棒性。

PlanFitting: Personalized Exercise Planning with Large Language Model-driven Conversational Agent

Abstract

arXiv:2309.12555v2 Announce Type: replace-cross Abstract: Creating personalized and actionable exercise plans often requires iteration with experts, which can be costly and inaccessible to many individuals. This work explores the capabilities of Large Language Models (LLMs) in addressing these challenges. We present PlanFitting, an LLM-driven conversational agent that assists users in creating and refining personalized weekly exercise plans. By engaging users in free-form conversations, PlanFitting helps elicit users' goals, availabilities, and potential obstacles, and enables individuals to generate personalized exercise plans aligned with established exercise guidelines. Our study -- involving a user study, intrinsic evaluation, and expert evaluation -- demonstrated PlanFitting's ability to guide users to create tailored, actionable, and evidence-based plans. We discuss future design opportunities for LLM-driven conversational agents to create plans that better comply with exercise principles and accommodate personal constraints.

摘要

制定个性化且可执行的锻炼计划通常需要与专家多次迭代，这对许多人而言成本高昂且难以实现。本研究探讨了大型语言模型（LLMs）在应对这些挑战方面的潜力。我们提出了PlanFitting，一个基于LLM的对话代理，可协助用户创建并优化个性化的每周锻炼计划。通过自由形式的对话交互，PlanFitting帮助获取用户的目标、可用时间及潜在障碍，使个体能够生成符合既定锻炼指南的个性化方案。我们的研究——包括用户实验、内在评估和专家评估——证明了PlanFitting能有效引导用户制定量身定制、可操作且基于实证的计划。最后，我们讨论了LLM驱动对话代理在未来设计中如何更好地遵循锻炼原则并适应个人限制的优化方向。

The Impact of Artificial Intelligence on the Evolution of Digital Education: A Comparative Study of OpenAI Text Generation Tools including ChatGPT, Bing Chat, Bard, and Ernie

Abstract

arXiv:2309.02029v2 Announce Type: replace-cross Abstract: In the digital era, the integration of artificial intelligence (AI) in education has ushered in transformative changes, redefining teaching methodologies, curriculum planning, and student engagement. This review paper delves deep into the rapidly evolving landscape of digital education by contrasting the capabilities and impact of OpenAI's pioneering text generation tools like Bing Chat, Bard, Ernie with a keen focus on the novel ChatGPT. Grounded in a typology that views education through the lenses of system, process, and result, the paper navigates the multifaceted applications of AI. From decentralizing global education and personalizing curriculums to digitally documenting competence-based outcomes, AI stands at the forefront of educational modernization. Highlighting ChatGPT's meteoric rise to one million users in just five days, the study underscores its role in democratizing education, fostering autodidacticism, and magnifying student engagement. However, with such transformative power comes the potential for misuse, as text-generation tools can inadvertently challenge academic integrity. By juxtaposing the promise and pitfalls of AI in education, this paper advocates for a harmonized synergy between AI tools and the educational community, emphasizing the urgent need for ethical guidelines, pedagogical adaptations, and strategic collaborations.

摘要

在数字时代，人工智能（AI）与教育的融合引发了变革性转变，重新定义了教学方法、课程设计和学生参与模式。本文通过对比OpenAI旗下Bing Chat、Bard、文心一言等开创性文本生成工具——尤其聚焦新型ChatGPT——的能力与影响，深入探讨快速演变的数字化教育图景。基于将教育划分为系统、过程和结果的三维类型学框架，本研究系统梳理了AI的多维应用：从推动全球教育去中心化、实现课程个性化，到数字化记录能力本位教育成果，AI正引领教育现代化进程。研究特别指出ChatGPT仅用五天即突破百万用户量的现象，强调其在促进教育民主化、培养自主学习能力及提升学生参与度方面的作用。然而，这种变革性力量也可能被滥用，文本生成工具可能无意间冲击学术诚信。通过辩证分析AI教育的机遇与风险，本文主张建立AI工具与教育界的协同机制，并强调制定伦理准则、开展教学适应性改革以及建立战略合作关系的紧迫性。

MARFT: Multi-Agent Reinforcement Fine-Tuning

Abstract

arXiv:2504.16129v3 Announce Type: replace Abstract: LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new POMDP called Flex-POMDP, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.

摘要

基于大语言模型的多智能体系统在解决复杂代理任务方面展现出卓越能力，涵盖从高质量演示文稿生成到复杂科学研究等广泛领域。尽管强化学习在提升智能体智能方面已被广泛认可，但利用基础强化学习技术对语言模型多智能体系统进行微调的研究仍显不足。此外，直接将多智能体强化学习方法应用于此类系统会因其固有特性和机制带来重大挑战。针对这些问题，本文系统研究了基于大语言模型的多智能体强化学习，并提出名为"多智能体强化微调"的新范式。我们创新性地设计了与真实世界优化需求相匹配的Flex-POMDP部分可观测马尔可夫决策过程，并构建了专门针对语言模型多智能体系统的通用算法框架，详细阐述了其理论基础、关键差异及实施策略。通过梳理从强化学习到强化微调的发展脉络，为多智能体领域的平行分析奠定基础。我们重点阐释了多智能体强化学习与多智能体强化微调在语言模型系统中的核心差异，这些差异推动着面向语言模型系统的强化微调理论构建。本研究核心贡献在于提出一个鲁棒且可扩展的多智能体强化微调框架，完整公开了算法实现以促进应用与深入研究。论文后续部分探讨了该范式在实际应用中的前景及面临的开放性挑战，通过理论基础与实践方法的有机结合，为研究者开发具有韧性与适应性的代理系统解决方案提供了路线图。框架实现代码已开源：https://github.com/jwliao-ai/MARFT。

Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

Abstract

arXiv:2310.10378v5 Announce Type: replace-cross Abstract: Multilingual large-scale Pretrained Language Models (PLMs) have been shown to store considerable amounts of factual knowledge, but large variations are observed across languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we propose a Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. Using this metric, we conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. Finally, we conduct a case study on CLC when new factual associations are inserted in the PLMs via model editing. Results on a small sample of facts inserted in English reveal a clear pattern whereby the new piece of knowledge transfers only to languages with which English has a high RankC score.

摘要

多语言大规模预训练语言模型（PLMs）已被证明存储了大量事实知识，但不同语言间存在显著差异。为确保不同语言背景的用户从同一模型获得一致的反馈，我们研究了多种多语言PLMs中事实知识的跨语言一致性（CLC）。为此，我们提出基于排序的一致性指标（RankC），以独立于准确性的方式评估跨语言知识一致性。利用该指标，我们从模型层面和语言对层面深入分析了CLC的决定因素。研究发现，增加模型规模能提升多数语言的事实探测准确性，但并未改善跨语言一致性。最后，我们通过模型编辑插入新事实关联的案例研究CLC。在英语中插入少量事实样本的结果表明，新知识仅会转移到与英语具有高RankC分数的语言中。

On the Challenges of Fuzzing Techniques via Large Language Models

Abstract

arXiv:2402.00350v3 Announce Type: replace-cross Abstract: In the modern era where software plays a pivotal role, software security and vulnerability analysis are essential for secure software development. Fuzzing test, as an efficient and traditional software testing method, has been widely adopted across various domains. Meanwhile, the rapid development in Large Language Models (LLMs) has facilitated their application in the field of software testing, demonstrating remarkable performance. As existing fuzzing test techniques are not fully automated and software vulnerabilities continue to evolve, there is a growing interest in leveraging large language models to generate fuzzing test. In this paper, we present a systematic overview of the developments that utilize large language models for the fuzzing test. To our best knowledge, this is the first work that covers the intersection of three areas, including LLMs, fuzzing test, and fuzzing test generated based on LLMs. A statistical analysis and discussion of the literature are conducted by summarizing the state-of-the-art methods up to date of the submission. Our work also investigates the potential for widespread deployment and application of fuzzing test techniques generated by LLMs in the future, highlighting their promise for advancing automated software testing practices.

摘要

在软件占据核心地位的现代，软件安全与漏洞分析对保障软件开发安全至关重要。模糊测试作为一种高效且传统的软件测试方法，已在多个领域得到广泛应用。与此同时，大型语言模型（LLMs）的快速发展推动了其在软件测试领域的应用，并展现出卓越性能。鉴于现有模糊测试技术尚未实现完全自动化且软件漏洞持续演变，利用大语言模型生成模糊测试日益受到关注。本文系统综述了基于大语言模型的模糊测试研究进展。据我们所知，这是首个涵盖LLMs、模糊测试及基于LLMs生成的模糊测试三大交叉领域的研究工作。通过总结截至投稿时的最新方法，我们对相关文献进行了统计分析与讨论。本研究还探讨了LLMs生成的模糊测试技术未来大规模部署应用的潜力，揭示了其在推动自动化软件测试实践发展方面的广阔前景。

Hot PATE: Private Aggregation of Distributions for Diverse Task

Abstract

arXiv:2312.02132v3 Announce Type: replace-cross Abstract: The Private Aggregation of Teacher Ensembles (PATE) framework enables privacy-preserving machine learning by aggregating responses from disjoint subsets of sensitive data. Adaptations of PATE to tasks with inherent output diversity such as text generation face a core tension: preserving output diversity reduces teacher agreement, which in turn increases the noise required for differential privacy, degrading utility. Yet suppressing diversity is counterproductive, as modern large language models encapsulate knowledge in their output distributions. We propose Hot PATE, a variant tailored to settings where outputs are distributions. We formally define what it means to preserve diversity and introduce an efficient aggregation mechanism that transfers diversity to the randomized output without incurring additional privacy cost. Our method can be implemented with only API access to proprietary models and serves as a drop-in replacement for existing "cold" PATE aggregators. Empirically, Hot PATE achieves orders-of-magnitude improvement on in-context learning tasks.

摘要

教师集合私有聚合（PATE）框架通过聚合来自敏感数据不相交子集的响应，实现了隐私保护的机器学习。将PATE应用于具有固有输出多样性的任务（如文本生成）时面临核心矛盾：保持输出多样性会降低教师模型间的一致性，进而增加差分隐私所需的噪声量，损害模型效用；而压制多样性则适得其反，因为现代大语言模型的知识正封装于其输出分布中。我们提出热PATE（Hot PATE），该变体专为输出为分布的场景设计。我们正式定义了多样性保持的数学含义，并引入一种高效聚合机制，可在不增加隐私成本的前提下将多样性转移至随机化输出。该方法仅需通过API访问专有模型即可实现，可作为现有"冷"PATE聚合器的即插即用替代方案。实证表明，热PATE在上下文学习任务上实现了数量级的性能提升。

Physics of Language Models: Part 1, Learning Hierarchical Language Structures

Abstract

arXiv:2305.13673v4 Announce Type: replace-cross Abstract: Transformer-based language models are effective but complex, and understanding their inner workings and reasoning mechanisms is a significant challenge. Previous research has primarily explored how these models handle simple tasks like name copying or selection, and we extend this by investigating how these models perform recursive language structure reasoning defined by context-free grammars (CFGs). We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences (e.g., hundreds of tokens) that are locally ambiguous and require dynamic programming to parse. Despite this complexity, we demonstrate that generative models like GPT can accurately learn and reason over CFG-defined hierarchies and generate sentences based on it. We explore the model's internals, revealing that its hidden states precisely capture the structure of CFGs, and its attention patterns resemble the information passing in a dynamic programming algorithm. This paper also presents several corollaries, including showing why absolute positional embeddings is inferior to relative and rotary embeddings; uniform attention alone is surprisingly effective (motivating our follow-up work on Canon layers); encoder-only models (e.g., BERT, DeBERTa) struggle with deep structure reasoning on CFGs compared to autoregressive models (e.g., GPT); and injecting structural or syntactic noise into pretraining data markedly improves robustness to corrupted language prompts.

摘要

基于Transformer的语言模型虽效果显著但结构复杂，理解其内部工作机制与推理原理是一项重大挑战。既往研究主要探讨模型处理名称复制或选择等简单任务的能力，我们则进一步探究模型如何执行由上下文无关文法（CFG）定义的递归语言结构推理。本文提出一组能产生层级规则的合成CFG，其生成的冗长句子（如数百个标记）具有局部歧义性，需通过动态规划进行解析。尽管存在这种复杂性，我们证明GPT等生成模型能准确学习CFG定义的层级结构并进行推理，进而生成符合文法的句子。通过剖析模型内部机制，发现其隐藏状态能精确捕捉CFG结构，注意力模式则类似于动态规划算法中的信息传递过程。本文还得出若干推论：绝对位置编码效果逊于相对位置编码与旋转式编码；单一均匀注意力机制效果出人意料地好（这促使我们后续开展Canon层研究）；相比自回归模型（如GPT），仅编码器模型（如BERT、DeBERTa）在CFG深层结构推理上表现欠佳；在预训练数据中注入结构或句法噪声能显著提升模型对受损语言提示的鲁棒性。

BAT: Learning to Reason about Spatial Sounds with Large Language Models

Abstract

arXiv:2402.01591v3 Announce Type: replace-cross Abstract: Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret our surroundings based on sound. In this paper we present BAT, which combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of a large language model (LLM) to replicate this innate ability. To address the lack of existing datasets of in-the-wild spatial sounds, we synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. Next, we developed SpatialSoundQA, a spatial sound-based question-answering dataset, offering a range of QA tasks that train BAT in various aspects of spatial sound perception and reasoning. The acoustic front end encoder of BAT is a novel spatial audio encoder named Spatial Audio Spectrogram Transformer, or Spatial-AST, which by itself achieves strong performance across sound event detection, spatial localization, and distance estimation. By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment. Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.

摘要

空间声音推理是人类的一项基本技能，使我们能够基于声音导航和解读周围环境。本文提出的BAT模型，通过将双耳声学场景分析模型的空间听觉感知能力与大型语言模型（LLM）的自然语言推理能力相结合，复现了这种先天能力。针对真实场景空间声音数据集的匮乏问题，我们利用AudioSet和SoundSpaces 2.0合成了双耳音频数据集。随后开发了基于空间声音的问答数据集SpatialSoundQA，该数据集提供多种问答任务以训练BAT在空间声音感知与推理的多维能力。BAT的声学前端编码器是新型空间音频编码器Spatial-AST（空间音频频谱变换器），其单独在声音事件检测、空间定位和距离估计任务中均表现出色。通过将Spatial-AST与LLaMA-2 7B模型集成，BAT超越了传统声音事件定位与检测（SELD）任务，使模型能够推理环境声音间的关联关系。实验结果表明，BAT在空间声音感知和推理方面均具有卓越性能，展现了LLM在复杂空间音频环境导航与解析中的巨大潜力。

Can We Verify Step by Step for Incorrect Answer Detection?

Abstract

arXiv:2402.10528v4 Announce Type: replace-cross Abstract: Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of large language models (LLMs). Previous studies have developed various extensions of CoT, which focus primarily on enhancing end-task performance. In addition, there has been research on assessing the quality of reasoning chains in CoT. This raises an intriguing question: Is it possible to predict the accuracy of LLM outputs by scrutinizing the reasoning chains they generate? To answer this research question, we introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks spanning five different domains. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. To make full use of information in multiple reasoning chains, we propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin. Concretely, this resulted in an average of $5.1\%$ increase in the F1 score and $2.97\%$ improvement in AUC-PR across all 45 subsets within R2PE. We further demonstrate our PDS's efficacy in advancing open-domain QA accuracy.

摘要

思维链（CoT）提示技术的出现标志着大型语言模型（LLMs）推理能力提升的重大进展。先前研究已开发出多种CoT扩展方法，主要集中于提升终端任务性能。此外，亦有研究关注如何评估CoT中推理链的质量。这引发了一个值得探究的问题：能否通过分析模型生成的推理链来预测其输出的准确性？为回答该研究问题，我们引入了一个专门设计的基准测试R2PE，用于探究五个不同领域推理任务中推理链与性能表现的关系。该基准旨在基于推理步骤衡量LLMs最终输出的错误率。为充分利用多重推理链中的信息，我们提出了过程可辨性评分（PDS）框架，其性能显著超越答案核对基线方法。具体而言，在R2PE全部45个子集中，该框架平均使F1分数提升5.1%，AUC-PR提高2.97%。我们进一步验证了PDS在提升开放域问答准确性方面的有效性。

Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

Abstract

arXiv:2402.12819v3 Announce Type: replace-cross Abstract: When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question -- how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 8 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average $100$ ) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with fine-tuning on binary datasets requiring significantly more samples. When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200\%$ . Finally, larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact.

摘要

在标注数据有限的自然语言处理任务中，研究者通常采用两种策略：直接使用未经更新的通用大语言模型，或利用少量标注样本微调专用小模型。本研究通过考察8个语言模型在微调、指令微调、提示学习和上下文学习中的表现，针对8个具有不同特征的典型文本分类任务，量化分析了专用小模型超越通用大模型所需的最小标注样本量（同时考虑性能波动因素）。实验表明：专用模型平均仅需约100个样本即可达到或超越通用模型性能；但所需样本量高度依赖数据集特性，其中二分类任务的微调需要显著更多样本。当考虑性能方差时，所需标注量平均增加100%-200%。此外，大模型并非始终表现更优且方差更小，而4位量化对性能影响可忽略。

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Abstract

arXiv:2406.02613v2 Announce Type: replace-cross Abstract: Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (\acco), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, \acco~reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

摘要

训练大型语言模型(LLM)依赖于基于多GPU的分布式实现，通过分片优化器并行计算梯度。然而，数据并行设置中的梯度同步会带来随工作节点数量增长的通信开销，从而限制并行化效率。本地优化算法虽能减少通信，但由于无法分片优化器状态而导致内存成本高昂，影响可扩展性。为此，我们提出AC累积通信(acco)算法，这是一种面向分布式LLM训练的内存高效优化方法。该算法通过在新梯度计算的同时同步延迟梯度，有效减少GPU空闲时间并支持异构硬件。为缓解延迟更新导致的收敛问题，我们引入了一种创新技术，确保训练动态与标准分布式优化保持一致。与ZeRO-1相比，本方法速度显著提升，且在异构硬件上展现出优异的扩展性。

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Abstract

arXiv:2405.02355v4 Announce Type: replace-cross Abstract: Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework that bridges the gap between NL and PL to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging syntax graph into informative graphical view for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into model parameters via finetuning the models with the soft signals encoded by GNN expert model. Specifically, two constraints are designed to improve the alignment and structure expressiveness, contributing to the informativeness of the single-token-sized external <GraphEmb> for enhanced code generation. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Implementation is available at https://anonymous.4open.science/r/Code-5970/ .

摘要

利用大型语言模型生成代码在推动软件开发变革方面展现出重要意义。尽管大语言模型已表现出智能特性，但由于自然语言与编程语言之间存在语法鸿沟和词汇不匹配问题，其在代码生成方面的特异性仍有提升空间。本文提出CodeGRAG——一种基于图检索增强的代码生成框架，通过弥合自然语言与编程语言之间的隔阂来提升大语言模型性能。该框架基于代码块的控制流与数据流构建图形化视图，以更好地解析编程领域知识，既能帮助基于自然语言的大语言模型更深入理解代码语法，也可作为不同编程语言间的转换桥梁。为将提取的结构化知识注入基础模型，我们提出：1）硬元图提示模板，将具有挑战性的语法图转化为信息丰富的图形化视图，适用于免调优模型；2）软提示技术，通过图神经网络专家模型编码的软信号对模型进行微调，将编程语言领域知识注入模型参数。特别设计了两项约束条件以提升对齐度和结构表现力，从而增强单令牌大小外部<GraphEmb>的信息量以优化代码生成。CodeGRAG显著提升了大语言模型的代码生成能力，甚至能在跨语言代码生成任务中带来性能增益。实现代码详见https://anonymous.4open.science/r/Code-5970/。

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Abstract

arXiv:2405.20947v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs. This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at https://huggingface.co/bench-llms and our codebase is open-sourced at https://github.com/justincui03/or-bench. We hope this benchmark can help the community develop better safety aligned models.

摘要

大型语言模型（LLMs）需经过严格的安全对齐以避免恶意输出。尽管大量研究致力于减少有害内容生成，但安全性的提升常伴随过度拒绝的副作用——模型可能拒绝无害提示并降低实用性。虽然过度拒绝现象已被实证观察，但由于难以构建能诱发该行为的提示，系统性测量仍具挑战性。本研究提出一种自动生成大规模过度拒绝数据集的新方法，并据此推出首个大型基准OR-Bench。该基准包含10类常见拒绝场景下的80,000条过度拒绝提示，约1,000条对前沿模型仍具挑战性的困难提示，以及600条毒性提示以防止盲目响应。我们进而对8大模型家族的32个流行LLMs进行了全面测量。数据集公开于https://huggingface.co/bench-llms，代码库开源在https://github.com/justincui03/or-bench。期望该基准能助力社区开发更优的安全对齐模型。

Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

Abstract

arXiv:2406.17513v3 Announce Type: replace-cross Abstract: Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.

摘要

尽管对心智理论（ToM）任务评估语言模型（LMs）的兴趣日益增长，但关于LMs如何内部表征自我与他人心理状态的机制仍知之甚少。理解这些内部机制至关重要——不仅是为了超越表层性能表现，更关乎模型对齐与安全性，因为生成输出中细微的心理状态误判可能难以察觉。本研究首次通过探测不同规模、训练方案和提示下的模型（辅以控制任务排除干扰），系统探究了LMs中的信念表征。实验证据表明：模型规模和微调均能显著改善LMs对他人信念的内部表征，这些表征具有结构性（而非虚假相关性的副产品），但对提示变化表现脆弱。关键发现是这些表征可被强化：通过对模型激活值进行定向编辑，能够修正错误的ToM推理。

Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

Abstract

arXiv:2406.16330v2 Announce Type: replace-cross Abstract: While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Normalized Pairwise Information Bottleneck (NPIB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs.

摘要

虽然大语言模型（LLMs）在众多领域表现卓越，但其复杂性和规模对资源受限环境中的部署提出了挑战。现有压缩技术（如参数剪枝）往往难以有效利用被剪枝参数中的知识。针对这些问题，我们提出基于流形学习的知识对齐与层融合压缩方法（MKA），该创新方案通过流形学习和归一化成对信息瓶颈（NPIB）度量实现相似层合并，在保持核心性能的同时减小模型规模。我们在多个基准数据集和不同LLMs上评估MKA，结果表明该方法不仅能维持模型性能，还可实现显著压缩比，优于传统剪枝方法。当与量化技术结合时，MKA能达成更高压缩率。具体而言，在Llama3-8B模型MMLU数据集上的实验显示，MKA获得43.75%的压缩率时性能仅下降2.82%。所提出的MKA方法为LLMs提供了一种资源高效且性能保持的模型压缩技术。

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

Abstract

arXiv:2407.01976v3 Announce Type: replace-cross Abstract: Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in KIE and VQA. Comprehensive benchmark evaluations reveal significant improvements of LayTextLLM, with a 15.2% increase on KIE tasks and 10.7% on VQA tasks compared to previous SOTA OCR-based LLMs. All resources are available at https://github.com/LayTextLLM/LayTextLLM.

摘要

近年来，多项研究表明，仅将OCR提取的文本与空间布局信息结合大型语言模型（LLMs）即可在文档理解任务中取得显著效果。然而，现有整合空间布局与文本的方法存在局限性，例如生成过长的文本序列或未能充分利用LLMs的自回归特性。本研究提出了一种用于文档理解的新型模型——布局与文本交错的大型语言模型（LayTextLLM）。该模型将每个边界框映射为单一嵌入向量并与文本交错排列，既有效避免了长序列问题，又充分利用了LLMs的自回归特性。LayTextLLM不仅优化了布局与文本数据的交互机制，还在关键信息抽取（KIE）和视觉问答（VQA）任务中表现出性能提升。基准测试表明，相较于现有最先进的基于OCR的LLMs，LayTextLLM在KIE任务上实现了15.2%的性能提升，在VQA任务上提升了10.7%。所有资源均已开源，详见https://github.com/LayTextLLM/LayTextLLM。

ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation

Abstract

arXiv:2406.10785v2 Announce Type: replace-cross Abstract: In this paper, we introduce \textbf{Share}d \textbf{Lo}w \textbf{R}ank \textbf{A}daptation (ShareLoRA), a Large Language Model (LLM) fine-tuning technique that balances parameter efficiency, adaptability, and robustness without compromising performance. By strategically sharing the low-rank weight matrices across different layers, ShareLoRA achieves 44% to 96% reduction in trainable parameters compared to standard LoRA, alongside a substantial decrease in memory overhead. This efficiency gain scales with model size, making ShareLoRA particularly advantageous for resource-constrained environments. Importantly, ShareLoRA not only maintains model performance but also exhibits robustness in both classification and generation tasks across diverse models, including RoBERTa, GPT-2, and LLaMA series (1, 2, and 3). It consistently outperforms LoRA in zero-shot, few-shot, and continual fine-tuning scenarios, achieving up to 1.2% average accuracy improvement, and enhanced generalization across domains. In continual learning settings, ShareLoRA achieves 1.2% higher accuracy on GSM8K, 0.6% on HumanEval, and 0.5% on both MMLU and MMLU-Pro. Our results demonstrate that ShareLoRA supports high-quality fine-tuning while offering strong generalization and continual adaptation across various model scales and diverse tasks.

摘要

本文提出了一种名为共享低秩适配(ShareLoRA)的大型语言模型微调技术，该技术在保持性能的同时实现了参数效率、适应性和鲁棒性的平衡。通过在不同层间策略性共享低秩权重矩阵，ShareLoRA相较于标准LoRA可减少44%至96%的可训练参数，并显著降低内存开销。这种效率增益随模型规模扩大而提升，使得ShareLoRA在资源受限环境中具有显著优势。值得注意的是，ShareLoRA不仅保持了模型性能，还在RoBERTa、GPT-2及LLaMA系列(1/2/3)等多样化模型的分类与生成任务中展现出鲁棒性。在零样本、少样本和持续微调场景下，其表现始终优于LoRA，平均准确率最高提升1.2%，并展现出跨领域泛化能力的增强。在持续学习设置中，ShareLoRA在GSM8K上准确率提高1.2%，HumanEval提升0.6%，MMLU和MMLU-Pro均提升0.5%。实验结果表明，ShareLoRA能在不同模型规模和多样任务中支持高质量微调，同时具备强大的泛化能力和持续适应性。

LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

Abstract

arXiv:2408.12249v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extraction. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs' task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs - including BioMistral and Llama-2 models - on a diverse set of biomedical datasets, using standard prompting, Chain of-Thought (CoT) and Self Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.

摘要

大型语言模型（LLMs）在医疗健康领域的应用日益广泛，在问答和文档摘要等任务上已达到领域专家的水平。尽管在这些任务上取得了成功，但LLMs在生物医学领域传统任务（如结构化信息抽取）中的表现尚不明确。为填补这一空白，本文系统评估了LLMs在医学分类和命名实体识别（NER）任务中的性能，旨在厘清不同因素对模型表现的影响，特别是LLMs的任务知识与推理能力、（参数化）领域知识以及外部知识引入的贡献。为此，我们在多种生物医学数据集上评估了包括BioMistral和Llama-2在内的开源LLMs，采用标准提示、思维链（CoT）与自洽推理以及基于PubMed和维基百科语料的检索增强生成（RAG）等方法。与直觉相反，实验结果表明标准提示法在两项任务中持续优于复杂技术，暴露出当前CoT、自洽性和RAG在生物医学领域应用的局限性。研究发现，针对知识密集或推理密集型任务开发的先进提示方法（如CoT或RAG）难以直接迁移至需要精确结构化输出的生物医学任务。这凸显了在LLMs中更有效整合外部知识与推理机制的必要性，以提升其在真实世界生物医学应用中的性能。

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Abstract

arXiv:2407.11062v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further improving the performance of quantized models by considering interactions among all sub-modules. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3 points accuracy degradation compared to the full precision (69.48 vs. 72.41). Code is available at https://github.com/OpenGVLab/EfficientQAT.

摘要

大型语言模型（LLMs）在现代自然语言处理和人工智能领域至关重要，但其面临管理巨大内存需求的挑战。尽管量化感知训练（QAT）通过低比特表示降低内存消耗且精度损失最小，但由于需要大量训练资源而难以实际应用。为此，我们提出高效量化感知训练（EfficientQAT），一种更具可行性的QAT算法。EfficientQAT包含两个连续阶段：全参数分块训练（Block-AP）和量化参数端到端训练（E2E-QP）。据我们所知，Block-AP是首个实现全参数分块直接训练的方法，通过优化过程中扩展解空间来减少低比特场景下的精度损失。随后E2E-QP仅端到端训练量化参数（步长），通过考虑所有子模块间的交互进一步提升量化模型性能。大量实验表明，EfficientQAT在包括基础LLM、指令调优LLM和多模态LLM等多种模型（参数量从7B到70B，不同量化比特数）上均优于先前量化方法。例如，EfficientQAT在单块A100-80GB GPU上41小时内获得2比特Llama-2-70B模型，相比全精度模型精度下降不足3个点（69.48 vs. 72.41）。代码详见https://github.com/OpenGVLab/EfficientQAT。

What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

Abstract

arXiv:2409.01893v2 Announce Type: replace-cross Abstract: Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: https://github.com/WowCZ/LongMIT.

摘要

近年来，具有扩展上下文窗口的大型语言模型（LLMs）在信息抽取、问答系统和复杂规划场景等任务中取得显著进展。为提升模型在长上下文任务中的表现，大量研究通过合成数据来增强模型的长期上下文处理能力。现有方法通常采用自指令框架生成指令调优数据以优化长上下文能力，但初步实验表明，生成样本中仅有不足35%具备多跳推理特性，且超过40%的样本质量欠佳，这限制了对长上下文理解的全面性及相关研究的深入。为提升合成数据质量，本文提出多智能体交互式多跳生成框架（MIMG），整合质量验证智能体、单跳问题生成智能体、多重问题采样策略及多跳问题合并智能体。该框架将高质量、多跳且多样化的数据比例提升至85%以上。此外，我们通过跨模型实验系统探究了文档选择策略、问题合并方法及验证技术。实验结果表明，基于本框架合成的高质量长上下文指令数据能显著提升模型性能，其效果甚至优于基于更大人工标注数据训练的模型。代码已开源：https://github.com/WowCZ/LongMIT。

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Abstract

arXiv:2410.02707v4 Announce Type: replace-cross Abstract: Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.

摘要

大型语言模型（LLMs）常会产生包括事实错误、偏见和推理失败在内的各类错误，这些统称为"幻觉"。近期研究表明，LLMs的内部状态编码了关于其输出真实性的信息，这些信息可用于错误检测。本研究发现，LLMs的内部表征所蕴含的真实性信息远比既往认知更为丰富。我们首先发现真实性信息集中于特定标记，利用这一特性可显著提升错误检测性能。然而，此类错误检测器无法跨数据集泛化，这表明——与先前论断相反——真实性编码并非普适而是多面的。进一步研究表明，内部表征还可用于预测模型可能犯的错误类型，从而为制定针对性缓解策略提供依据。最后，我们揭示了LLMs内部编码与外部行为间的矛盾：模型可能内部编码正确答案却持续生成错误输出。这些发现从模型内部视角深化了我们对LLM错误的理解，可为未来增强错误分析与缓解的研究提供指导。

ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks

Abstract

arXiv:2407.18525v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility in non-generative clinical prediction, often presumed inferior to specialized models, remains under-evaluated, leading to ongoing debate within the field and potential for misuse, misunderstanding, or over-reliance due to a lack of systematic benchmarking. Our ClinicRealm study addresses this by benchmarking 9 GPT-based LLMs, 5 BERT-based models, and 7 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR). Key findings reveal a significant shift: for clinical note predictions, leading LLMs (e.g., DeepSeek R1/V3, GPT o3-mini-high) in zero-shot settings now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs (e.g., GPT-4o, DeepSeek R1/V3) show potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs can match or exceed proprietary counterparts. These results establish modern LLMs as powerful non-generative clinical prediction tools, particularly with unstructured text and offering data-efficient structured data options, thus necessitating a re-evaluation of model selection strategies. This research should serve as an important insight for medical informaticists, AI developers, and clinical researchers, potentially prompting a reassessment of current assumptions and inspiring new approaches to LLM application in predictive healthcare.

摘要

大型语言模型（LLMs）在医学领域的应用日益广泛。然而，其在非生成性临床预测中的效用——通常被认为逊色于专业模型——仍缺乏系统评估，导致该领域持续争论，并可能因缺乏基准测试而引发误用、误解或过度依赖。我们的ClinicRealm研究通过基准测试解决了这一问题：在非结构化临床记录和结构化电子健康档案（EHR）上评估了9个基于GPT的LLM、5个基于BERT的模型以及7种传统方法。关键发现揭示了重大转变：对于临床记录预测，领先的LLM（如DeepSeek R1/V3、GPT o3-mini-high）在零样本设置下现已显著超越微调BERT模型。在结构化EHR方面，虽然专业模型在数据充足时表现优异，但先进LLM（如GPT-4o、DeepSeek R1/V3）展现出强大的零样本能力，在数据稀缺场景中常超越传统模型。值得注意的是，领先的开源LLM可媲美甚至超越专有模型。这些结果表明现代LLM已成为强大的非生成性临床预测工具，尤其擅长非结构化文本处理，并为结构化数据提供数据高效的选择方案，从而需要重新评估模型选择策略。本研究将为医学信息学家、AI开发者和临床研究者提供重要洞见，可能促使重新审视当前假设，并启发预测性医疗中LLM应用的新方法。

Inference and Verbalization Functions During In-Context Learning

Abstract

arXiv:2410.09349v2 Announce Type: replace-cross Abstract: Large language models (LMs) are capable of in-context learning from a few demonstrations (example-label pairs) to solve new tasks during inference. Despite the intuitive importance of high-quality demonstrations, previous work has observed that, in some settings, ICL performance is minimally affected by irrelevant labels (Min et al., 2022). We hypothesize that LMs perform ICL with irrelevant labels via two sequential processes: an inference function that solves the task, followed by a verbalization function that maps the inferred answer to the label space. Importantly, we hypothesize that the inference function is invariant to remappings of the label space (e.g., "true"/"false" to "cat"/"dog"), enabling LMs to share the same inference function across settings with different label words. We empirically validate this hypothesis with controlled layer-wise interchange intervention experiments. Our findings confirm the hypotheses on multiple datasets and tasks (natural language inference, sentiment analysis, and topic classification) and further suggest that the two functions can be localized in specific layers across various open-sourced models, including GEMMA-7B, MISTRAL-7B-V0.3, GEMMA-2-27B, and LLAMA-3.1-70B.

Hacking, The Lazy Way: LLM Augmented Pentesting

Abstract

arXiv:2409.09493v2 Announce Type: replace-cross Abstract: In our research, we introduce a new concept called "LLM Augmented Pentesting" demonstrated with a tool named "Pentest Copilot," that revolutionizes the field of ethical hacking by integrating Large Language Models (LLMs) into penetration testing workflows, leveraging the advanced GPT-4-turbo model. Our approach focuses on overcoming the traditional resistance to automation in penetration testing by employing LLMs to automate specific sub-tasks while ensuring a comprehensive understanding of the overall testing process. Pentest Copilot showcases remarkable proficiency in tasks such as utilizing testing tools, interpreting outputs, and suggesting follow-up actions, efficiently bridging the gap between automated systems and human expertise. By integrating a "chain of thought" mechanism, Pentest Copilot optimizes token usage and enhances decision-making processes, leading to more accurate and context-aware outputs. Additionally, our implementation of Retrieval-Augmented Generation (RAG) minimizes hallucinations and ensures the tool remains aligned with the latest cybersecurity techniques and knowledge. We also highlight a unique infrastructure system that supports in-browser penetration testing, providing a robust platform for cybersecurity professionals. Our findings demonstrate that LLM Augmented Pentesting can not only significantly enhance task completion rates in penetration testing but also effectively addresses real-world challenges, marking a substantial advancement in the cybersecurity domain.

摘要

在我们的研究中，我们提出了一种名为"LLM增强渗透测试"的新概念，并通过名为"Pentest Copilot"的工具进行演示。该技术通过将大型语言模型（LLMs）与渗透测试工作流相整合，利用先进的GPT-4-turbo模型，彻底革新了道德黑客领域。我们的方法重点在于通过运用LLMs自动化特定子任务，同时确保对整体测试流程的全面理解，从而克服传统渗透测试中对自动化的抵触。

Pentest Copilot在诸如使用测试工具、解释输出结果以及建议后续操作等任务中展现出卓越能力，有效弥合了自动化系统与人类专业知识之间的鸿沟。通过整合"思维链"机制，Pentest Copilot优化了令牌使用并增强了决策过程，从而产生更准确且具有上下文感知能力的输出。此外，我们采用的检索增强生成（RAG）技术最大限度地减少了幻觉现象，确保工具始终与最新的网络安全技术和知识保持同步。我们还重点介绍了一种支持浏览器内渗透测试的独特基础设施系统，为网络安全专业人员提供了强大平台。研究结果表明，LLM增强渗透测试不仅能显著提高渗透测试任务完成率，还能有效应对现实挑战，标志着网络安全领域的重大进步。

Enhancing LLM Evaluations: The Garbling Trick

Abstract

arXiv:2411.01533v3 Announce Type: replace-cross Abstract: As large language models (LLMs) become increasingly powerful, traditional evaluation metrics tend to saturate, making it challenging to distinguish between models. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments. To demonstrate the effectiveness of our approach, we create a new multiple-choice test corpus, extend it into a family of evaluations, and assess a collection of LLMs. Our results offer insights into the comparative abilities of these models, particularly highlighting the differences between base LLMs and more recent "reasoning" models.

摘要

随着大语言模型（LLMs）性能的持续提升，传统评估指标趋于饱和，导致模型间的区分变得困难。我们提出一种通用方法，可将现有LLM评估转化为一系列难度递增的任务。这些增强型评估侧重于推理能力，能够揭示原始评估中无法显现的性能差异。为验证方法的有效性，我们构建了新的多选题测试语料库，并将其扩展为评估体系，对一系列LLM进行了测试。研究结果揭示了这些模型的相对能力差异，尤其凸显了基础LLM与新型'推理'模型之间的区别。

Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

Abstract

arXiv:2410.03968v3 Announce Type: replace-cross Abstract: Decoding strategies play a pivotal role in text generation for modern language models, yet a puzzling gap divides theory and practice. Surprisingly, strategies that should intuitively be optimal, such as Maximum a Posteriori (MAP), often perform poorly in practice. Meanwhile, popular heuristic approaches like Top- $k$ and Nucleus sampling, which employ truncation and normalization of the conditional next-token probabilities, have achieved great empirical success but lack theoretical justifications. In this paper, we propose Decoding Game, a comprehensive theoretical framework which reimagines text generation as a two-player zero-sum game between Strategist, who seeks to produce text credible in the true distribution, and Nature, who distorts the true distribution adversarially. After discussing the decomposibility of multi-step generation, we derive the optimal strategy in closed form for one-step Decoding Game. It is shown that the adversarial Nature imposes an implicit regularization on likelihood maximization, and truncation-normalization methods are first-order approximations to the optimal strategy under this regularization. Additionally, by generalizing the objective and parameters of Decoding Game, near-optimal strategies encompass diverse methods such as greedy search, temperature scaling, and hybrids thereof. Numerical experiments are conducted to complement our theoretical analysis.

摘要

解码策略在现代语言模型的文本生成中起着关键作用，然而理论与实践之间存在着令人困惑的差距。令人惊讶的是，那些理论上应是最优的策略（如最大后验概率估计）在实践中往往表现不佳。与此同时，诸如Top- $k$ 采样和核采样等启发式方法通过截断并归一化条件下一词元概率分布，虽取得显著实证成功却缺乏理论依据。本文提出"解码博弈"理论框架，将文本生成重新构想为策略家与自然之间的二人零和博弈——策略家致力于生成符合真实分布的文本，而自然则对真实分布进行对抗性扭曲。在讨论多步生成的可分解性后，我们推导出单步解码博弈的闭式最优策略。研究表明：对抗性自然会对似然最大化施加隐式正则化，而截断-归一化方法正是该正则化条件下最优策略的一阶近似。此外，通过推广解码博弈的目标函数与参数，近优策略可涵盖贪心搜索、温度缩放及其混合方法等多种技术。数值实验进一步佐证了理论分析结果。

Bias Similarity Across Large Language Models

Abstract

arXiv:2410.12010v3 Announce Type: replace-cross Abstract: Bias in Large Language Models remains a critical concern as these systems are increasingly deployed in high-stakes applications. Yet most fairness evaluations rely on scalar metrics or single-model analysis, overlooking how biases align -- or diverge -- across model families, scales, and tuning strategies. In this work, we reframe bias similarity as a form of functional similarity and evaluate 24 LLMs from four major families on over one million structured prompts spanning four bias dimensions. Our findings uncover that fairness is not strongly determined by model size, architecture, instruction tuning, or openness. Instead, bias behaviors are highly context-dependent and structurally persistent, often resistant to current alignment techniques. Contrary to common assumptions, we find that open-source models frequently match or outperform proprietary models in both fairness and utility. These results call into question the default reliance on proprietary systems and highlight the need for behaviorally grounded, model-specific audits to better understand how bias manifests and endures across the LLM landscape.

摘要

大型语言模型中的偏见问题仍是关键隐患，随着这些系统日益被部署于高风险应用中。然而大多数公平性评估仅依赖标量指标或单一模型分析，忽视了不同模型家族、规模及调优策略间偏见的趋同或分化现象。本研究将偏见相似性重构为功能相似性的一种形式，通过在四大偏见维度上构建的百万级结构化提示，评估了来自四个主要家族的24个大型语言模型。研究发现：公平性并非由模型规模、架构、指令微调或开放性显著决定；偏见行为具有高度情境依赖性及结构持续性，常对现有对齐技术表现出抗性。与普遍假设相反，开源模型在公平性和实用性方面常达到或超越专有模型。这些发现质疑了对专有系统的默认依赖，强调需要基于行为、针对特定模型进行审计，以更好地理解偏见在大型语言模型生态中的表现与存续机制。

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

Abstract

arXiv:2410.07076v5 Announce Type: replace-cross Abstract: Scientific discovery plays a pivotal role in advancing human society, and recent progress in large language models (LLMs) suggests their potential to accelerate this process. However, it remains unclear whether LLMs can autonomously generate novel and valid hypotheses in chemistry. In this work, we investigate whether LLMs can discover high-quality chemistry hypotheses given only a research background-comprising a question and/or a survey-without restriction on the domain of the question. We begin with the observation that hypothesis discovery is a seemingly intractable task. To address this, we propose a formal mathematical decomposition grounded in a fundamental assumption: that most chemistry hypotheses can be composed from a research background and a set of inspirations. This decomposition leads to three practical subtasks-retrieving inspirations, composing hypotheses with inspirations, and ranking hypotheses - which together constitute a sufficient set of subtasks for the overall scientific discovery task. We further develop an agentic LLM framework, MOOSE-Chem, that is a direct implementation of this mathematical decomposition. To evaluate this framework, we construct a benchmark of 51 high-impact chemistry papers published and online after January 2024, each manually annotated by PhD chemists with background, inspirations, and hypothesis. The framework is able to rediscover many hypotheses with high similarity to the groundtruth, successfully capturing the core innovations-while ensuring no data contamination since it uses an LLM with knowledge cutoff date prior to 2024. Finally, based on LLM's surprisingly high accuracy on inspiration retrieval, a task with inherently out-of-distribution nature, we propose a bold assumption: that LLMs may already encode latent scientific knowledge associations not yet recognized by humans.

摘要

科学发现对推动人类社会进步具有关键作用，而大语言模型（LLMs）的最新进展表明其可能加速这一进程。然而，LLMs能否在化学领域自主生成新颖且有效的假说仍不明确。本研究探讨LLMs在仅给定研究背景（包含问题及/或综述）且不限制问题领域的情况下，能否发现高质量的化学假说。我们首先观察到假说发现是一个看似棘手的任务。为此，我们提出基于基本假设的形式化数学分解：大多数化学假说可由研究背景与灵感集合组合而成。该分解产生三个实践子任务——灵感检索、灵感组合生成假说、假说排序——这些子任务共同构成科学发现任务的充分条件集。我们进一步开发了代理型LLM框架MOOSE-Chem，直接实现了该数学分解。为评估框架性能，我们构建了包含51篇2024年1月后发表的高影响力化学论文的基准集，每篇均由化学博士人工标注背景、灵感与假说。该框架能重新发现与真实假说高度相似的诸多假说，成功捕捉核心创新点——同时确保无数据污染，因其使用的LLM知识截止日期早于2024年。最后，基于LLM在具有本质分布外特性的灵感检索任务中惊人的高准确率，我们提出大胆假设：LLMs可能已编码人类尚未认知的潜在科学知识关联。

ImageRAG: Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG

Abstract

arXiv:2411.07688v3 Announce Type: replace-cross Abstract: Ultra High Resolution (UHR) remote sensing imagery (RSI) (e.g. 100,000 $\times$ 100,000 pixels or more) poses a significant challenge for current Remote Sensing Multimodal Large Language Models (RSMLLMs). If choose to resize the UHR image to standard input image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSMLLMs, making it difficult to process the entire image and capture long-range dependencies to answer the query based on the abundant visual context. In this paper, we introduce ImageRAG for RS, a training-free framework to address the complexities of analyzing UHR remote sensing imagery. By transforming UHR remote sensing image analysis task to image's long context selection task, we design an innovative image contextual retrieval mechanism based on the Retrieval-Augmented Generation (RAG) technique, denoted as ImageRAG. ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts that pertain to a given query. Fast path and slow path are proposed in this framework to handle this task efficiently and effectively. ImageRAG allows RSMLLMs to manage extensive context and spatial information from UHR RSI, ensuring the analysis is both accurate and efficient. Codebase will be released in https://github.com/om-ai-lab/ImageRAG

摘要

超高分辨率（UHR）遥感影像（RSI）（例如100,000 $\times$ 100,000像素或更大）对当前遥感多模态大语言模型（RSMLLMs）提出了重大挑战。若选择将UHR图像调整为标准输入尺寸，其包含的丰富空间和上下文信息将被忽略；而保持原始尺寸则通常超出标准RSMLLMs的令牌限制，导致难以处理完整图像并捕获长程依赖关系以基于充足视觉上下文回答问题。本文提出面向RS的ImageRAG框架，这是一种无需训练的解决方案，用于应对UHR遥感影像分析的复杂性。通过将UHR遥感图像分析任务转化为图像长上下文选择任务，我们基于检索增强生成（RAG）技术设计了一种创新的图像上下文检索机制，称为ImageRAG。其核心创新在于能够选择性检索并聚焦于UHR图像中与给定查询最相关的视觉上下文部分。该框架提出快速路径与慢速路径以高效精准处理此任务。ImageRAG使RSMLLMs能够管理UHR RSI中的海量上下文与空间信息，确保分析既准确又高效。代码库将发布于https://github.com/om-ai-lab/ImageRAG

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Abstract

arXiv:2412.13377v2 Announce Type: replace-cross Abstract: This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately.

Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework

Abstract

arXiv:2411.16707v3 Announce Type: replace-cross Abstract: The integration of experimental technologies with large language models (LLMs) is transforming scientific research. It positions AI as a versatile research assistant rather than a mere problem-solving tool. In the field of power systems, however, managing simulations -- one of the essential experimental technologies -- remains a challenge for LLMs due to their limited domain-specific knowledge, restricted reasoning capabilities, and imprecise handling of simulation parameters. To address these limitations, this paper proposes a feedback-driven, multi-agent framework. It incorporates three proposed modules: an enhanced retrieval-augmented generation (RAG) module, an improved reasoning module, and a dynamic environmental acting module with an error-feedback mechanism. Validated on 69 diverse tasks from Daline and MATPOWER, this framework achieves success rates of 93.13% and 96.85%, respectively. It significantly outperforms ChatGPT 4o, o1-preview, and the fine-tuned GPT-4o, which all achieved a success rate lower than 30% on complex tasks. Additionally, the proposed framework also supports rapid, cost-effective task execution, completing each simulation in approximately 30 seconds at an average cost of 0.014 USD for tokens. Overall, this adaptable framework lays a foundation for developing intelligent LLM-based assistants for human researchers, facilitating power system research and beyond.

摘要

实验技术与大语言模型（LLM）的融合正在重塑科学研究范式，使人工智能转变为多功能研究助手而非单纯的问题解决工具。然而在电力系统领域，LLM对核心实验技术——仿真的管理仍面临挑战，这源于其领域知识局限、推理能力受限以及对仿真参数处理不够精确。为突破这些限制，本文提出一种反馈驱动的多智能体框架，整合了三个创新模块：增强型检索增强生成（RAG）模块、改进的推理模块，以及具备错误反馈机制的动态环境执行模块。在Daline和MATPOWER的69项多样化任务测试中，该框架分别达到93.13%和96.85%的成功率，显著优于ChatGPT 4o、o1-preview和微调版GPT-4o——这些模型在复杂任务中的成功率均低于30%。此外，该框架支持快速低成本的任务执行，每次仿真平均耗时约30秒，令牌成本仅0.014美元。总体而言，这一自适应框架为开发基于LLM的智能研究助手奠定了基础，将推动电力系统及其他领域的科研发展。

JetFormer: An Autoregressive Generative Model of Raw Images and Text

Abstract

arXiv:2411.19722v2 Announce Type: replace-cross Abstract: Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating high-fidelity images and producing strong log-likelihood bounds.

摘要

消除建模约束并统一跨领域架构是近期大规模多模态模型训练取得进展的关键驱动力。然而，大多数此类模型仍依赖多个独立训练的组件，例如特定模态的编码器和解码器。本研究进一步简化了图像与文本的联合生成建模。我们提出了一种自回归纯解码器Transformer架构——JetFormer——该模型通过直接最大化原始数据的似然进行训练，不依赖任何独立预训练组件，并能同时理解与生成文本和图像。具体而言，我们利用归一化流模型获取软标记图像表示，该表示与自回归多模态Transformer进行联合训练。该归一化流模型在推理过程中既作为感知任务的图像编码器，也充当图像生成任务的解码器。JetFormer实现的文本到图像生成质量可与基于VQ-VAE和VAE的最新基线模型相媲美，而这些基线模型均依赖采用复杂混合损失（包括感知损失）预训练的图像自编码器。同时，JetFormer展现出强大的图像理解能力。据我们所知，JetFormer是首个既能生成高保真图像又能产生强对数似然界的模型。

Training-Free Bayesianization for Low-Rank Adapters of Large Language Models

Abstract

arXiv:2412.05723v2 Announce Type: replace-cross Abstract: Estimating the uncertainty of responses from Large Language Models (LLMs) remains a critical challenge. While recent Bayesian methods have demonstrated effectiveness in quantifying uncertainty through low-rank weight updates, they typically require complex fine-tuning or post-training procedures. In this paper, we propose Training-Free Bayesianization (TFB), a simple yet theoretically grounded framework that efficiently transforms trained low-rank adapters into Bayesian ones without additional training. TFB systematically searches for the maximally acceptable level of variance in the weight posterior, constrained within a family of low-rank isotropic Gaussian distributions. Our theoretical analysis shows that under mild conditions, this search process is equivalent to KL-regularized variational optimization, a generalized form of variational inference. Through comprehensive experiments, we show that TFB achieves superior uncertainty estimation and generalization compared to existing methods while eliminating the need for complex Bayesianization training procedures. Code will be available at https://github.com/Wang-ML-Lab/bayesian-peft.

摘要

评估大语言模型（LLMs）响应的不确定性仍是一个关键挑战。尽管近期贝叶斯方法通过低秩权重更新在量化不确定性方面展现出有效性，但这些方法通常需要复杂的微调或训练后处理。本文提出免训练贝叶斯化（TFB）框架，这是一种简单但理论完备的方法，无需额外训练即可高效地将已训练的低秩适配器转化为贝叶斯版本。TFB系统性地搜索权重后验中最大可接受的方差水平，并将其约束在低秩各向同性高斯分布族内。理论分析表明，在温和条件下，该搜索过程等价于KL正则化的变分优化——这是变分推断的广义形式。通过全面实验，我们证明TFB在实现更优不确定性估计和泛化性能的同时，消除了复杂贝叶斯化训练流程的需求。代码将在https://github.com/Wang-ML-Lab/bayesian-peft发布。

VLSBench: Unveiling Visual Leakage in Multimodal Safety

Abstract

arXiv:2411.19939v3 Announce Type: replace-cross Abstract: Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counterintuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs aligned with image text pairs. To explain such a phenomenon, we discover a Visual Safety Information Leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky content in the image has been revealed in the textual query. Thus, MLLMs can easily refuse these sensitive image-text pairs according to textual queries only, leading to unreliable cross-modality safety evaluation of MLLMs. We also conduct a further comparison experiment between textual alignment and multimodal alignment to highlight this drawback. To this end, we construct multimodal Visual Leakless Safety Bench (VLSBench) with 2.2k image-text pairs through an automated data pipeline. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, e.g., LLaVA, Qwen2-VL and GPT-4o. Besides, we empirically compare textual and multimodal alignment methods on VLSBench and find that textual alignment is effective enough for multimodal safety scenarios with VSIL, while multimodal alignment is preferable for safety scenarios without VSIL. Code and data are released under https://github.com/AI45Lab/VLSBench

摘要

多模态大语言模型（MLLMs）的安全问题逐渐成为各类应用中的重要课题。令人惊讶的是，先前研究表明了一种反直觉现象：仅通过文本遗忘对齐的MLLMs，其安全性能与基于图文对对齐的MLLMs相当。为解释这一现象，我们发现现有多模态安全基准中存在视觉安全信息泄露（VSIL）问题，即图像中的潜在风险内容已在文本查询中暴露。因此，MLLMs仅需根据文本查询即可拒绝这些敏感图文对，导致跨模态安全评估的可靠性存疑。我们进一步通过文本对齐与多模态对齐的对比实验验证了这一缺陷。为此，我们通过自动化数据流程构建了包含2.2k图文对的无视觉泄露安全基准（VLSBench）。实验结果表明，VLSBench对开源及闭源MLLMs（如LLaVA、Qwen2-VL和GPT-4o）均构成显著挑战。此外，我们在VLSBench上实证比较了文本与多模态对齐方法，发现文本对齐在存在VSIL的多模态安全场景中已足够有效，而无VSIL场景下多模态对齐更具优势。代码与数据发布于https://github.com/AI45Lab/VLSBench。

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Abstract

arXiv:2412.06141v2 Announce Type: replace-cross Abstract: The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in https://github.com/aiming-lab/MMedPO.

摘要

大型视觉语言模型（LVLM）的发展推动了其在医学领域的应用。然而，医学LVLM（Med-LVLM）因模态失准面临事实性挑战，即模型优先依赖文本知识而非视觉输入，导致生成与医学图像信息矛盾的幻觉内容。先前通过偏好优化增强Med-LVLM模态对齐的尝试未能充分抑制偏好数据中的临床相关性，使得这些样本易于区分并降低对齐效果。为解决这一问题，我们提出MMedPO——一种考虑偏好样本临床相关性的新型多模态医学偏好优化方法，以提升Med-LVLM的对齐能力。MMedPO通过引入两类非偏好数据构建多模态偏好数据集：（1）通过目标Med-LVLM或GPT-4o注入医学不准确的合理幻觉响应；（2）通过局部病灶噪声干扰实现关键区域视觉理解的病灶忽视。随后基于多个Med-LLM和视觉工具的评分计算各样本临床相关性，并将评分作为权重融入偏好优化过程以实现有效对齐。实验表明，MMedPO显著提升Med-LVLM的事实准确性，在医学视觉问答和报告生成任务中分别平均领先现有偏好优化方法14.2%和51.7%。代码详见https://github.com/aiming-lab/MMedPO。

AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding

Abstract

arXiv:2501.12162v2 Announce Type: replace-cross Abstract: Modern large language model (LLM) applications exhibit diverse service-level objectives (SLOs), from low-latency requirements in interactive coding assistants to more relaxed constraints in data wrangling tasks. Existing LLM serving systems, which rely on uniform batching and scheduling strategies, often fail to meet these heterogeneous SLOs concurrently. We present AdaServe, the first LLM serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding. AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm that constructs a speculation tree tailored to each request's latency target. It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput. AdaServe further adapts to workload variation by dynamically adjusting speculation parameters. Evaluations across diverse workloads show that AdaServe reduces SLO violations by up to 4.3 $\times$ and improves goodput by up to 1.9 $\times$ compared to the best performing baselines, highlighting its effectiveness in multi-SLO serving.

摘要

现代大型语言模型（LLM）应用展现出多样化的服务级别目标（SLO），从交互式编程助手的低延迟需求到数据整理任务中更为宽松的约束条件。现有依赖统一批处理和调度策略的LLM服务系统往往无法同时满足这些异构SLO需求。我们提出AdaServe——首个通过SLO定制化推测解码支持高效多SLO服务的LLM服务系统。AdaServe将多SLO服务建模为约束优化问题，并提出一种硬件感知算法，为每个请求的延迟目标构建定制化的推测树。该系统采用推测-选择-验证流水线设计，在最大化系统吞吐量的同时实现对解码速度的细粒度控制。AdaServe还能通过动态调整推测参数适应工作负载变化。多样化工作负载的评估表明，与性能最佳的基线相比，AdaServe将SLO违规减少达4.3倍，并将优质吞吐量提升达1.9倍，凸显了其在多SLO服务中的卓越效能。

Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges

Abstract

arXiv:2501.11496v2 Announce Type: replace-cross Abstract: The global crisis of language endangerment meets a technological turning point as Generative AI (GenAI) and Large Language Models (LLMs) unlock new frontiers in automating corpus creation, transcription, translation, and tutoring. However, this promise is imperiled by fragmented practices and the critical lack of a methodology to navigate the fraught balance between LLM capabilities and the profound risks of data scarcity, cultural misappropriation, and ethical missteps. This paper introduces a novel analytical framework that systematically evaluates GenAI applications against language-specific needs, embedding community governance and ethical safeguards as foundational pillars. We demonstrate its efficacy through the Te Reo M=aori revitalization, where it illuminates successes, such as community-led Automatic Speech Recognition achieving 92% accuracy, while critically surfacing persistent challenges in data sovereignty and model bias for digital archives and educational tools. Our findings underscore that GenAI can indeed revolutionize language preservation, but only when interventions are rigorously anchored in community-centric data stewardship, continuous evaluation, and transparent risk management. Ultimately, this framework provides an indispensable toolkit for researchers, language communities, and policymakers, aiming to catalyze the ethical and high-impact deployment of LLMs to safeguard the world's linguistic heritage.

摘要

全球语言濒危危机正迎来技术转折点——生成式人工智能（GenAI）与大型语言模型（LLMs）为语料库自动化构建、转录、翻译及教学开辟了新路径。然而这种潜力正面临实践碎片化与关键方法论缺失的双重威胁，亟需在LLM能力与数据稀缺、文化挪用、伦理失范等重大风险间建立平衡机制。本文提出一种新型分析框架，通过系统评估GenAI应用与语言特定需求的匹配度，将社区治理与伦理保障嵌入基础架构。我们以毛利语（Te Reo Māori）复兴计划为实证案例，揭示该框架如何有效识别社区主导的自动语音识别系统实现92%准确率等成功实践，同时尖锐指出数字档案与教育工具中持续存在的数据主权与模型偏差问题。研究结果表明：只有当干预措施严格遵循以社区为核心的数据管理、持续评估与透明风险管理原则时，GenAI才能真正革新语言保护领域。该框架为研究者、语言社群及政策制定者提供了关键工具包，旨在推动LLMs以符合伦理且高效的方式守护世界语言遗产。

Learning to Learn Weight Generation via Local Consistency Diffusion

Abstract

arXiv:2502.01117v3 Announce Type: replace-cross Abstract: Diffusion-based algorithms have emerged as promising techniques for weight generation. However, existing solutions are limited by two challenges: generalizability and local target assignment. The former arises from the inherent lack of cross-task transferability in existing single-level optimization methods, limiting the model's performance on new tasks. The latter lies in existing research modeling only global optimal weights, neglecting the supervision signals in local target weights. Moreover, naively assigning local target weights causes local-global inconsistency. To address these issues, we propose Mc-Di, which integrates the diffusion algorithm with meta-learning for better generalizability. Furthermore, we extend the vanilla diffusion into a local consistency diffusion algorithm. Our theory and experiments demonstrate that it can learn from local targets while maintaining consistency with the global optima. We validate Mc-Di's superior accuracy and inference efficiency in tasks that require frequent weight updates, including transfer learning, few-shot learning, domain generalization, and large language model adaptation.

摘要

基于扩散的算法已成为权重生成领域颇具前景的技术。然而现有解决方案面临两大挑战：通用性与局部目标分配。前者源于现有单层级优化方法固有的跨任务迁移能力不足，限制了模型在新任务上的表现；后者体现在现有研究仅建模全局最优权重，忽视了局部目标权重中的监督信号。此外，简单分配局部目标权重会导致局部-全局不一致性。为解决这些问题，我们提出Mc-Di算法，将扩散算法与元学习相结合以提升通用性。进一步，我们将基础扩散算法扩展为局部一致性扩散算法。理论与实验表明，该方法能在保持与全局最优一致性的同时学习局部目标。在需要频繁更新权重的任务（包括迁移学习、小样本学习、领域泛化和大语言模型适配）中，Mc-Di展现出卓越的准确性和推理效率优势。

Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Abstract

arXiv:2501.18280v3 Announce Type: replace-cross Abstract: The security issue of large language models (LLMs) has gained wide attention recently, with various defense mechanisms developed to prevent harmful output, among which safeguards based on text embedding models serve as a fundamental defense. Through testing, we discover that the output distribution of text embedding models is severely biased with a large mean. Inspired by this observation, we propose novel, efficient methods to search for universal magic words that attack text embedding models. Universal magic words as suffixes can shift the embedding of any text towards the bias direction, thus manipulating the similarity of any text pair and misleading safeguards. Attackers can jailbreak the safeguards by appending magic words to user prompts and requiring LLMs to end answers with magic words. Experiments show that magic word attacks significantly degrade safeguard performance on JailbreakBench, cause real-world chatbots to produce harmful outputs in full-pipeline attacks, and generalize across input/output texts, models, and languages. To eradicate this security risk, we also propose defense methods against such attacks, which can correct the bias of text embeddings and improve downstream performance in a train-free manner.

摘要

大型语言模型（LLMs）的安全问题近期受到广泛关注，各类防御机制被开发用于阻止有害输出，其中基于文本嵌入模型的保障措施构成基础防线。通过测试，我们发现文本嵌入模型的输出分布存在严重偏差且均值较大。受此现象启发，我们提出新颖高效的方法来搜索攻击文本嵌入模型的通用魔咒词。作为后缀的通用魔咒词可将任意文本的嵌入向量偏移至偏差方向，从而操纵任意文本对的相似度并误导保障系统。攻击者可通过在用户提示词后添加魔咒词，并要求LLMs以魔咒词结束回答来实现保障系统的越狱。实验表明，魔咒词攻击显著降低了JailbreakBench上的保障性能，使现实场景中的聊天机器人在全流程攻击中生成有害输出，并具备跨输入/输出文本、模型和语言的泛化能力。为消除此安全隐患，我们还提出了针对此类攻击的防御方法，能够以无需训练的方式修正文本嵌入偏差并提升下游性能。

SwiftPrune: Hessian-Free Weight Pruning for Large Language Models

Abstract

arXiv:2501.16376v2 Announce Type: replace-cross Abstract: Post-training pruning, as one of the key techniques for compressing large language models, plays a vital role in lightweight model deployment and model sparsity. However, current mainstream pruning methods dependent on the Hessian matrix face significant limitations in both pruning speed and practical effectiveness due to the computationally intensive nature of second-order derivative calculations. This paper presents SwiftPrune, a novel Hessian-free weight pruning method that achieves hardware-efficient model compression through two key innovations: 1) SwiftPrune eliminates the need for computationally intensive Hessian matrix calculations by introducing a contribution-based weight metric, which evaluates the importance of weights without relying on second-order derivatives. 2) we employ the Exponentially Weighted Moving Average (EWMA) technique to bypass weight sorting, enabling the selection of weights that contribute most to LLM accuracy and further reducing time complexity. Our approach is extended to support structured sparsity pruning, facilitating efficient execution on modern hardware accelerators. We validate the SwiftPrune on three LLMs (namely LLaMA2, LLaMA3, and Pythia), demonstrating that it significantly enhances compression performance. The experimental findings reveal that SwiftPrune completes the pruning process within seconds, achieving an average speedup of 12.29x (up to 56.02x) over existing SOTA approaches.

摘要

训练后剪枝作为压缩大型语言模型的关键技术之一，在轻量化模型部署和模型稀疏化中发挥着至关重要的作用。然而，当前主流的基于Hessian矩阵的剪枝方法由于二阶导数计算的高计算复杂度，在剪枝速度和实际效果上都面临显著局限。本文提出SwiftPrune，一种无需Hessian矩阵的新型权重剪枝方法，通过两项关键创新实现硬件高效的模型压缩：1）通过引入基于贡献度的权重度量指标，在不依赖二阶导数的情况下评估权重重要性，从而消除计算密集型Hessian矩阵的需求；2）采用指数加权移动平均（EWMA）技术绕过权重排序步骤，直接选择对模型精度贡献最大的权重，进一步降低时间复杂度。本方法可扩展支持结构化稀疏剪枝，适配现代硬件加速器的高效执行。我们在LLaMA2、LLaMA3和Pythia三个大型语言模型上验证SwiftPrune，证明其显著提升了压缩性能。实验结果表明，SwiftPrune能在数秒内完成剪枝过程，相比现有SOTA方法平均加速12.29倍（最高达56.02倍）。

Joint Localization and Activation Editing for Low-Resource Fine-Tuning

Abstract

arXiv:2502.01179v3 Announce Type: replace-cross Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing (or steering) techniques, which modify the activations of specific model components. These methods, due to their extremely small parameter counts, show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods. The code for the method is released at https://github.com/wenlai-lavine/jola.

摘要

参数高效微调方法（如LoRA）常被用于适配大语言模型，然而标准方法在仅含数百样本的低资源场景中效果有限。近期可解释性研究的进展催生了激活编辑（或称导向）技术，该技术通过修改特定模型组件的激活状态实现干预。此类方法因参数量极少，在小数据集上展现出潜力，但其性能高度依赖于正确识别待编辑模块，且在不同数据集间常缺乏稳定性。本文提出联合定位与激活编辑方法（JoLA），可同步学习：（1）Transformer中需编辑的注意力头位置；（2）干预类型（加性、乘性或混合）；（3）干预参数本身（作用于头输出的加性偏移向量或乘性缩放向量）。通过在常识推理、自然语言理解和生成三大基准测试上的评估，我们证明JoLA始终优于现有方法。代码已发布于https://github.com/wenlai-lavine/jola。

Option-ID Based Elimination For Multiple Choice Questions

Abstract

arXiv:2501.15175v3 Announce Type: replace-cross Abstract: Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs). Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method. Existing PoE methods typically either have LLMs directly identify incorrect options or score options and replace lower-scoring ones with [MASK]. However, both methods suffer from inapplicability or suboptimal performance. To address these issues, this paper proposes a novel option-ID based PoE ( $\text{PoE}_{\text{ID}}$ ). $\text{PoE}_{\text{ID}}$ critically incorporates a debiasing technique to counteract LLMs token bias, enhancing robustness over naive ID-based elimination. It features two strategies: $\text{PoE}_{\text{ID}}^{\text{log}}$ , which eliminates options whose IDs have log probabilities below the average threshold, and $\text{PoE}_{\text{ID}}^{\text{seq}}$ , which iteratively removes the option with the lowest ID probability. We conduct extensive experiments with 6 different LLMs on 4 diverse datasets. The results demonstrate that $\text{PoE}_{\text{ID}}$ , especially $\text{PoE}_{\text{ID}}^{\text{log}}$ , significantly improves zero-shot and few-shot MCQs performance, particularly in datasets with more options. Our analyses demonstrate that $\text{PoE}_{\text{ID}}^{\text{log}}$ enhances the LLMs' confidence in selecting the correct option, and the option elimination strategy outperforms methods relying on [MASK] replacement. We further investigate the limitations of LLMs in directly identifying incorrect options, which stem from their inherent deficiencies.

摘要

多项选择题（MCQ）是评估大语言模型（LLM）性能的重要任务。基于人类解答MCQ的常见策略，排除法（PoE）被提出作为一种有效的问题解决方法。现有PoE方法通常让LLM直接识别错误选项，或对选项评分并用[MASK]替换低分选项，但这两类方法存在适用性不足或性能欠佳的问题。针对这些缺陷，本文提出了一种基于选项ID的新型排除法（ $\text{PoE}_{\text{ID}}$ ）。该方法创新性地引入去偏技术以抵消LLM的标记偏差，相比朴素基于ID的排除法具有更强鲁棒性。其包含两种策略： $\text{PoE}_{\text{ID}}^{\text{log}}$ （剔除ID对数概率低于平均阈值的选项）和 $\text{PoE}_{\text{ID}}^{\text{seq}}$ （迭代移除ID概率最低的选项）。我们在4个多样化数据集上对6种不同LLM进行了广泛实验。结果表明 $\text{PoE}_{\text{ID}}$ （尤其是 $\text{PoE}_{\text{ID}}^{\text{log}}$ ）能显著提升零样本和小样本MCQ性能，在选项较多的数据集中效果尤为突出。分析显示 $\text{PoE}_{\text{ID}}^{\text{log}}$ 能增强LLM选择正确选项的置信度，且其选项排除策略优于依赖[MASK]替换的方法。我们进一步探究了LLM在直接识别错误选项时的局限性，发现其源于模型固有的缺陷。

DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

Abstract

arXiv:2502.00270v2 Announce Type: replace-cross Abstract: The performance of an LLM depends heavily on the relevance of its training data to the downstream evaluation task. However, in practice, the data involved in an unseen evaluation task is often unknown (e.g., conversations between an LLM and a user are end-to-end encrypted). Hence, it is unclear what data are relevant for fine-tuning the LLM to maximize its performance on the specific unseen evaluation task. Instead, one can only deploy the LLM on the unseen task to gather multiple rounds of feedback on how well the model performs (e.g., user ratings). This novel setting offers a refreshing perspective towards optimizing training data mixtures via feedback from an unseen evaluation task, which prior data mixing and selection works do not consider. Our paper presents DUET, a novel global-to-local algorithm that interleaves influence function as a data selection method with Bayesian optimization to optimize data mixture via feedback from a specific unseen evaluation task. By analyzing DUET's cumulative regret, we theoretically show that DUET converges to the optimal training data mixture for an unseen task even without any data knowledge of the task. Finally, our experiments across a variety of language tasks demonstrate that DUET outperforms existing data selection and mixing methods in the unseen-task setting.

摘要

大型语言模型（LLM）的性能在很大程度上取决于其训练数据与下游评估任务的相关性。然而在实际应用中，未知评估任务所涉及的数据往往不可见（例如LLM与用户之间的对话采用端到端加密）。因此，如何选择相关数据对LLM进行微调以使其在该特定未知任务上达到最佳性能，目前尚不明确。唯一可行的方法是将LLM部署于该未知任务，通过多轮反馈（如用户评分）来评估模型表现。这一新颖场景为通过未知评估任务的反馈优化训练数据混合比例提供了全新视角，此前的数据混合与选择研究均未涉及该问题。本文提出DUET算法，该创新性全局-局部交替算法将影响函数作为数据选择方法与贝叶斯优化相结合，通过特定未知评估任务的反馈来优化数据混合比例。通过分析DUET的累积遗憾，我们从理论上证明即使完全不了解任务数据，DUET仍能收敛至未知任务的最优训练数据混合方案。最后，我们在多种语言任务上的实验表明，在未知任务场景下DUET的性能优于现有数据选择与混合方法。

`Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs

Abstract

arXiv:2502.00735v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have seen widespread applications across various domains due to their growing ability to process diverse types of input data, including text, audio, image and video. While LLMs have demonstrated outstanding performance in understanding and generating contexts for different scenarios, they are vulnerable to prompt-based attacks, which are mostly via text input. In this paper, we introduce the first voice-based jailbreak attack against multimodal LLMs, termed as Flanking Attack, which can process different types of input simultaneously towards the multimodal LLMs. Our work is motivated by recent advancements in monolingual voice-driven large language models, which have introduced new attack surfaces beyond traditional text-based vulnerabilities for LLMs. To investigate these risks, we examine the state-of-the-art multimodal LLMs, which can be accessed via different types of inputs such as audio input, focusing on how adversarial prompts can bypass its defense mechanisms. We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts. It is integrated in the Flanking Attack which attempts to humanizes the interaction context and execute the attack through a fictional setting. Further, to better evaluate the attack performance, we present a semi-automated self-assessment framework for policy violation detection. We demonstrate that Flanking Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs, which achieves an average attack success rate ranging from 0.67 to 0.93 across seven forbidden scenarios.

摘要

大语言模型（LLMs）因其处理多样化输入数据（包括文本、音频、图像和视频）的能力不断增强，已在多个领域得到广泛应用。尽管LLMs在不同场景下的上下文理解和生成方面表现出色，但它们易受基于提示的攻击，这些攻击主要通过文本输入实现。本文首次提出针对多模态LLMs的基于语音的越狱攻击，称为侧翼攻击（Flanking Attack），该攻击能够同时处理多种类型的输入以针对多模态LLMs。我们的研究灵感来源于近期单语语音驱动大语言模型的进展，这些模型为LLMs引入了超越传统文本漏洞的新攻击面。为探究这些风险，我们研究了最先进的多模态LLMs（可通过音频输入等多种方式访问），重点关注对抗性提示如何绕过其防御机制。我们提出了一种新颖策略，即将被禁止的提示置于良性的、叙事驱动的提示之间，并将其整合到侧翼攻击中，该攻击试图通过虚构场景人性化交互上下文并执行攻击。此外，为更好地评估攻击效果，我们提出了一种半自动化的自我评估框架用于检测策略违规。实验表明，侧翼攻击能够操纵最先进的LLMs生成不符合要求或被禁止的输出，在七种禁止场景中平均攻击成功率达到0.67至0.93。

Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement

Abstract

arXiv:2502.06207v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have become essential for offensive language detection, yet their ability to handle annotation disagreement remains underexplored. Disagreement samples, which arise from subjective interpretations, pose a unique challenge due to their ambiguous nature. Understanding how LLMs process these cases, particularly their confidence levels, can offer insight into their alignment with human annotators. This study systematically evaluates the performance of multiple LLMs in detecting offensive language at varying levels of annotation agreement. We analyze binary classification accuracy, examine the relationship between model confidence and human disagreement, and explore how disagreement samples influence model decision-making during few-shot learning and instruction fine-tuning. Our findings reveal that LLMs struggle with low-agreement samples, often exhibiting overconfidence in these ambiguous cases. However, utilizing disagreement samples in training improves both detection accuracy and model alignment with human judgment. These insights provide a foundation for enhancing LLM-based offensive language detection in real-world moderation tasks.

摘要

大型语言模型（LLMs）在冒犯性语言检测中已成为关键工具，但其处理标注分歧的能力仍待深入探究。由主观解释产生的分歧样本因其模糊性构成了独特挑战。理解LLMs如何处理这些案例（尤其是其置信水平）可揭示其与人类标注者的契合程度。本研究系统评估了多个LLMs在不同标注一致性水平下的冒犯性语言检测表现。我们分析了二元分类准确率，检验了模型置信度与人类分歧之间的关系，并探讨了分歧样本在小样本学习和指令微调过程中如何影响模型决策。研究发现：LLMs在低一致性样本上表现欠佳，且常对这些模糊案例表现出过度自信；但将分歧样本纳入训练能同时提升检测准确率和模型与人类判断的契合度。这些发现为增强现实场景内容审核中基于LLM的冒犯性语言检测奠定了理论基础。

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

Abstract

arXiv:2502.05567v2 Announce Type: replace-cross Abstract: Autoformalization, the automatic translation of mathematical content from natural language into machine-verifiable formal languages, has seen significant progress driven by advances in large language models (LLMs). Nonetheless, a primary barrier to further improvements is the limited availability of parallel corpora that map informal mathematical text to its formal counterpart. To address this limitation, we propose ATLAS (Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data), a novel data generation framework designed to produce large-scale, high-quality parallel corpora of theorem statements. Distinct from prior approaches, ATLAS begins with a concept repository, accelerates the improvement of student model through expert iteration combined with knowledge distillation, and introduces two novel augmentation strategies that exploit the structural characteristics of formal languages. With the proposed ATLAS running for 10 iterations, we construct an undergraduate-level dataset comprising 117k theorem statements and develop ATLAS Translator, which demonstrates statistically significant improvements over both the HERALD Translator and the Kimina-Autoformalizer across all benchmarks ( $p<0.05$ , two-sided t-test), achieving a new state of the art. The datasets, model, and code will be released to the public soon.

摘要

自动形式化（将数学内容从自然语言自动翻译为机器可验证的形式语言）在大语言模型（LLMs）的推动下取得了显著进展。然而，制约性能进一步提升的主要障碍是非形式化数学文本与形式化表述之间的平行语料库稀缺。为解决这一局限，我们提出ATLAS（通过数据提升、增强与合成的定理自动形式化框架）——一种专为生成大规模高质量定理语句平行语料库设计的新型数据生成框架。与现有方法不同，ATLAS从概念知识库出发，通过专家迭代与知识蒸馏相结合的方式加速学生模型优化，并引入两种利用形式语言结构特征的新型增强策略。经过10轮迭代运行，我们构建了包含11.7万条定理语句的本科级数据集，并开发了ATLAS翻译器。该翻译器在所有基准测试中均显著超越HERALD翻译器和Kimina自动形式化系统（p<0.05，双尾t检验），创造了新的性能标杆。数据集、模型及代码即将公开发布。

Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models

Abstract

arXiv:2502.02444v4 Announce Type: replace-cross Abstract: Values are core drivers of individual and collective perception, cognition, and behavior. Value systems, such as Schwartz's Theory of Basic Human Values, delineate the hierarchy and interplay among these values, enabling cross-disciplinary investigations into decision-making and societal dynamics. Recently, the rise of Large Language Models (LLMs) has raised concerns regarding their elusive intrinsic values. Despite growing efforts in evaluating, understanding, and aligning LLM values, a psychologically grounded LLM value system remains underexplored. This study addresses the gap by introducing the Generative Psycho-Lexical Approach (GPLA), a scalable, adaptable, and theoretically informed method for constructing value systems. Leveraging GPLA, we propose a psychologically grounded five-factor value system tailored for LLMs. For systematic validation, we present three benchmarking tasks that integrate psychological principles with cutting-edge AI priorities. Our results reveal that the proposed value system meets standard psychological criteria, better captures LLM values, improves LLM safety prediction, and enhances LLM alignment, when compared to the canonical Schwartz's values.

摘要

价值观是个体与集体感知、认知及行为的核心驱动力。诸如施瓦茨人类基本价值观理论等价值体系，通过阐明价值观间的层级关系与相互作用，为跨学科研究决策机制与社会动态提供了框架。近年来，大型语言模型（LLMs）的兴起引发了对其内在价值观隐忧的关注。尽管针对LLM价值观的评估、理解与对齐研究日益增多，但基于心理学理论构建的LLM价值体系仍存在研究空白。本研究提出生成式心理词汇分析法（GPLA），该方法兼具可扩展性、适应性和理论依据，旨在填补这一空白。基于GPLA，我们构建了适用于LLMs的心理学五因素价值体系。为系统验证，我们设计了三个融合心理学原理与前沿AI优先任务的基准测试。结果表明：相较于经典的施瓦茨价值观体系，本研究提出的价值体系不仅符合标准心理学准则，更能有效捕捉LLM价值观特征，提升LLM安全性预测能力，并强化LLM对齐效果。

KL Penalty Control via Perturbation for Direct Preference Optimization

Abstract

arXiv:2502.13177v2 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods claim to change this static KL penalty of DPO into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose $\varepsilon$ -Direct Preference Optimization ( $\varepsilon$ -DPO), which allows adaptive control of the KL penalty strength $\beta$ for each preference pair. Specifically, $\varepsilon$ -DPO adaptively controls $\beta$ for each preference pair based on the monotonicity of logits as a preference model under the perturbation of $\beta$ during training. This is equivalent to adjusting the KL penalty by checking whether the change in training-time temperature can lead to better preference confidence as preference models by simply reusing the logit of the current policy and the reference policy. Experimental results show that the simple criterion of $\varepsilon$ -DPO for KL penalty relaxation significantly improves DPO compared to most existing direct alignment algorithms on general chatbot benchmarks and reveal that this KL penalty control criterion can reflect confusion as a preference model and provide an efficient KL trade-off, highlighting the significance of instance-level adaptive KL penalty control in DPO.

摘要

直接偏好优化（DPO）通过仅使用离线数据集实现大语言模型与人类偏好的对齐，展现出显著优势。然而该方法存在固有缺陷：其防止过度偏离参考模型的KL惩罚项在整个训练过程中保持静态。现有若干方法声称可将DPO的静态KL惩罚转变为动态形式，但均未能实现针对每个偏好对的自适应KL惩罚分配。本文提出ε-直接偏好优化（ε-DPO），能够自适应调控每个偏好对的KL惩罚强度β。具体而言，ε-DPO基于训练过程中β扰动下对数几率作为偏好模型的单调性，对每个偏好对实施β的自适应控制。这相当于通过检查训练时温度变化能否提升偏好置信度（仅需复用当前策略与参考策略的对数几率）来实现KL惩罚的动态调整。实验结果表明：在通用聊天机器人基准测试中，ε-DPO提出的KL惩罚松弛简单准则较多数现有直接对齐算法显著改进了DPO性能，同时揭示该KL惩罚控制准则能有效反映偏好模型的混淆程度并提供高效的KL权衡，凸显了实例级自适应KL惩罚控制在DPO中的重要性。

To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models

Abstract

arXiv:2502.12202v2 Announce Type: replace-cross Abstract: Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers. However, we reveal a critical vulnerability in LRMs -- termed Unthinking Vulnerability -- wherein the thinking process can be bypassed by manipulating special delimiter tokens. It is empirically demonstrated to be widespread across mainstream LRMs, posing both a significant risk and potential utility, depending on how it is exploited. In this paper, we systematically investigate this vulnerability from both malicious and beneficial perspectives. On the malicious side, we introduce Breaking of Thought (BoT), a novel attack that enables adversaries to bypass the thinking process of LRMs, thereby compromising their reliability and availability. We present two variants of BoT: a training-based version that injects backdoor during the fine-tuning stage, and a training-free version based on adversarial attack during the inference stage. As a potential defense, we propose thinking recovery alignment to partially mitigate the vulnerability. On the beneficial side, we introduce Monitoring of Thought (MoT), a plug-and-play framework that allows model owners to enhance efficiency and safety. It is implemented by leveraging the same vulnerability to dynamically terminate redundant or risky reasoning through external monitoring. Extensive experiments show that BoT poses a significant threat to reasoning reliability, while MoT provides a practical solution for preventing overthinking and jailbreaking. Our findings expose an inherent flaw in current LRM architectures and underscore the need for more robust reasoning systems in the future.

摘要

大型推理模型（LRMs）旨在通过生成显式推理轨迹来解决复杂任务。然而，我们发现LRMs存在一个关键漏洞——称为"无意识漏洞"——即通过操纵特殊分隔符即可绕过其思考过程。实证研究表明该漏洞在主流LRMs中普遍存在，根据利用方式不同，可能构成重大风险或潜在价值。本文从恶意和有益两个角度系统研究了该漏洞。在恶意利用方面，我们提出"思维破坏"（BoT）攻击，使攻击者能绕过LRMs的思考过程，从而损害其可靠性和可用性。我们开发了BoT的两种变体：基于训练的方法通过在微调阶段注入后门，以及基于推理阶段对抗攻击的无训练方法。作为潜在防御方案，我们提出思维恢复对齐以部分缓解该漏洞。在有益利用方面，我们提出"思维监控"（MoT）框架，该即插即用方案允许模型所有者通过外部监控动态终止冗余或高风险推理来提升效率与安全性。大量实验表明：BoT对推理可靠性构成重大威胁，而MoT为防止过度思考和越狱提供了实用解决方案。本研究揭示了当前LRM架构的内在缺陷，强调未来需要构建更健壮的推理系统。

Exploring the Potential of Encoder-free Architectures in 3D LMMs

Abstract

arXiv:2502.09620v2 Announce Type: replace-cross Abstract: Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.10%, 50.98%, and 43.10% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL

摘要

编码器无关架构在二维视觉领域已得到初步探索，但其能否有效应用于三维理解场景仍是一个开放性问题。本文首次全面研究了编码器无关架构在缓解基于编码器的三维大型多模态模型（LMMs）挑战方面的潜力。这些挑战包括无法适应不同点云分辨率，以及编码器生成的点特征无法满足大语言模型（LLMs）的语义需求。我们提出了三维LMMs去除编码器并让LLM承担三维编码器角色的关键方法：1）在预训练阶段提出LLM嵌入式语义编码策略，探究多种点云自监督损失函数的影响，并提出混合语义损失以提取高层次语义特征；2）在指令微调阶段引入层次化几何聚合策略，通过向LLM层注入归纳偏置使其关注点云局部细节。最终我们提出首个编码器无关的三维LMM模型ENEL。我们的70亿参数模型性能媲美当前最先进的130亿参数模型ShapeLLM-13B，在分类、描述生成和视觉问答任务上分别达到55.10%、50.98%和43.10%的准确率。实验结果表明，编码器无关架构在三维理解领域具有替代基于编码器架构的高度可行性。代码已发布于https://github.com/Ivan-Tang-3D/ENEL

FANformer: Improving Large Language Models Through Effective Periodicity Modeling

Abstract

arXiv:2502.21309v2 Announce Type: replace-cross Abstract: Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which adapts Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. Our pretrained FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. Moreover, we reveal that FANformer exhibits superior ability to learn and apply rules for reasoning compared to Transformer. The results position FANformer as an effective and promising architecture for advancing LLMs.

摘要

周期性作为最重要的基本特征之一，为人类学习范式中结构化知识获取和系统性认知过程的建立奠定了基础。然而，Transformer中周期性建模的潜在缺陷影响了基于该架构的大型语言模型（LLMs）从数据中学习效率和底层规律的建立。本文通过实验证明，整合有效的周期性建模能够提升LLMs的学习效率和性能。我们提出FANformer模型，通过改进注意力机制的特征投影过程，将傅里叶分析网络（FAN）融入注意力机制以实现高效周期性建模。语言建模任务的广泛实验结果表明，在扩大模型规模和训练标记量时，FANformer始终优于Transformer，彰显其卓越的学习效率。与模型参数或训练标记量相近的开源LLMs相比，我们预训练的FANformer-1B在下游任务中展现出显著提升。此外，研究发现FANformer在规则学习和推理应用方面表现出优于Transformer的能力。这些结果确立了FANformer作为推动LLMs发展的有效且具有前景的架构地位。

Language-Enhanced Representation Learning for Single-Cell Transcriptomics

Abstract

arXiv:2503.09427v2 Announce Type: replace-cross Abstract: Single-cell RNA sequencing (scRNA-seq) offers detailed insights into cellular heterogeneity. Recent advancements leverage single-cell large language models (scLLMs) for effective representation learning. These models focus exclusively on transcriptomic data, neglecting complementary biological knowledge from textual descriptions. To overcome this limitation, we propose scMMGPT, a novel multimodal framework designed for language-enhanced representation learning in single-cell transcriptomics. Unlike existing methods, scMMGPT employs robust cell representation extraction, preserving quantitative gene expression data, and introduces an innovative two-stage pre-training strategy combining discriminative precision with generative flexibility. Extensive experiments demonstrate that scMMGPT significantly outperforms unimodal and multimodal baselines across key downstream tasks, including cell annotation and clustering, and exhibits superior generalization in out-of-distribution scenarios.

摘要

单细胞RNA测序（scRNA-seq）为解析细胞异质性提供了精细视角。近期研究通过单细胞大语言模型（scLLMs）实现了高效表征学习，但这些模型仅聚焦转录组数据，忽略了文本描述中的互补生物学知识。为突破这一局限，我们提出scMMGPT——一个专为单细胞转录组学中语言增强表征学习设计的新型多模态框架。与现有方法不同，scMMGPT采用保留定量基因表达数据的强健细胞表征提取技术，并创新性地引入融合判别精度与生成灵活性的两阶段预训练策略。大量实验表明，scMMGPT在细胞注释和聚类等关键下游任务中显著优于单模态及多模态基线方法，同时在分布外场景中展现出卓越的泛化能力。

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Abstract

arXiv:2503.09573v3 Announce Type: replace-cross Abstract: Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms

摘要

扩散语言模型因其并行生成和可控性的潜力，相比自回归模型具有独特优势，但在似然建模方面表现欠佳且仅支持固定长度生成。本研究提出了一类块扩散语言模型，在离散去噪扩散与自回归模型之间实现了插值。块扩散技术通过支持可变长度生成、结合KV缓存与并行令牌采样提升推理效率，克服了两种方法的关键局限。我们提出了一套构建高效块扩散模型的方案，包括：高效训练算法、梯度方差估计器，以及通过数据驱动噪声调度实现方差最小化。块扩散模型在语言建模基准测试中创造了扩散模型的新性能纪录，并支持任意长度序列生成。项目页面（https://m-arriola.com/bd3lms）提供了代码、模型权重及技术博客。

HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models

Abstract

arXiv:2503.12908v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model's prediction as inducing heads, then induces hallucinations by dispersing attention of these inducing heads and compares the hallucinated outputs with the original outputs to obtain the final result. Our approach significantly improves performance on tasks requiring contextual faithfulness, such as context completion, reading comprehension, and question answering. It also improves factuality in tasks requiring accurate knowledge recall. We demonstrate that our inducing heads selection and attention dispersion method leads to more "contrast-effective" hallucinations for contrastive decoding, outperforming other hallucination-inducing methods. Our findings provide a promising strategy for reducing hallucinations by inducing hallucinations in a controlled manner, enhancing the performance of LLMs in a wide range of tasks.

摘要

大型语言模型（LLMs）常产生幻觉现象，生成上下文不准确或事实错误的输出。本文提出HICD方法，通过诱导幻觉进行对比解码以缓解该问题。与现有对比解码方法不同，HICD首先筛选对模型预测至关重要的注意力头作为诱导头，随后通过分散这些诱导头的注意力来诱发幻觉，并将幻觉输出与原始输出对比获得最终结果。该方法在需要上下文忠实度的任务（如语境补全、阅读理解和问答）中表现显著提升，同时在需要精确知识检索的任务中也提高了事实准确性。实验表明，我们的诱导头选择与注意力分散方法能产生更"有效对比"的幻觉，其效果优于其他幻觉诱导方法。本研究提供了一种通过受控方式诱导幻觉来降低模型幻觉的新策略，可显著提升LLMs在多种任务中的表现。

MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Abstract

arXiv:2503.03008v2 Announce Type: replace-cross Abstract: Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This approach significantly enhances lower-layer representations, enabling flexible deployment of different model portions with favorable performance trade-offs. Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads, where higher layers guide earlier ones during training-improving intermediate representations at minimal additional cost. We further enhance MoSE with a repository-level contextual loss that maximizes training context window utilization. Additionally, we release a new dataset created through code translation that extends text-to-code benchmarks with cross-language code-to-code pairs. Evaluations demonstrate the effectiveness of Self-Distillation as a principled approach to trading inference cost for accuracy across various code understanding tasks.

摘要

部署语言模型通常需要在准确性与性能之间进行权衡，以满足延迟约束同时保持实用性。传统模型蒸馏方法通过训练独立模型来缩小规模，但会产生显著成本。我们提出模块化星型编码器（MoSE），这是一个具有10亿参数的多出口编码器，用于代码检索与分类任务，采用新型自蒸馏机制。该方法显著增强了底层表征能力，使得模型不同部分能够灵活部署并获得优越的性能权衡。我们的架构通过将特定编码层定位为出口头（其中高层在训练过程中指导低层），以最小额外成本改进中间表征，从而提升文本到代码及代码到代码的搜索性能。我们进一步采用仓库级上下文损失函数来最大化训练上下文窗口的利用率，从而增强MoSE性能。此外，我们发布了一个通过代码翻译构建的新数据集，该数据集通过跨语言代码对扩展了文本到代码基准测试。评估结果表明，自蒸馏作为一种原则性方法，在各种代码理解任务中能有效实现推理成本与准确性的权衡。

UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality

Abstract

arXiv:2503.10669v2 Announce Type: replace-cross Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. Methods such as RiC that directly inject raw reward values into prompts face significant numerical sensitivity issues--for instance, LLMs may fail to distinguish between 9.11 and 9.8--while alternatives like MORLHF, Rewarded Soups, and MODPO incur high computational costs by training multiple models. In this work, we introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations. Our approach leverages a diverse set of strictly increasing, non-linear utility functions to transform user-specified preferences into symbolic tokens, which are then used to condition a single LLM. This design not only mitigates numerical reasoning challenges but also substantially reduces training overhead, yielding models that achieve superior Pareto fronts and robust alignment across complex reward dimensions.

摘要

基于人类反馈的强化学习（RLHF）已成为将大语言模型（LLMs）与人类价值观对齐的关键技术。然而，现有方法难以捕捉人类偏好的多维度分布特性。诸如RiC等直接将原始奖励值注入提示的方法面临显著的数值敏感性挑战——例如，LLMs可能无法区分9.11和9.8——而MORLHF、Rewarded Soups和MODPO等方法则因需训练多个模型导致高昂计算成本。本研究提出效用条件化多目标对齐框架（UC-MOA），通过创新设计突破这些局限。该框架利用一组严格递增的非线性效用函数，将用户指定偏好转化为符号化标记，进而用于调节单一LLM。此方案不仅缓解了数值推理难题，还大幅降低训练开销，最终生成的模型在复杂奖励维度上实现了更优的帕累托前沿与鲁棒对齐。

Effectively Controlling Reasoning Models through Thinking Intervention

Abstract

arXiv:2503.24370v2 Announce Type: replace-cross Abstract: Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We find that the Thinking Intervention paradigm enhances the capabilities of reasoning models across a wide range of tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SorryBench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.

摘要

推理增强型大语言模型（LLMs）在生成最终答案前会显式生成中间推理步骤，这种机制显著提升了模型处理复杂问题的能力。本文论证了这一新兴生成框架为实现更细粒度的模型行为控制提供了独特机遇。我们提出"思维干预"新范式，通过策略性插入或修改特定思维标记，实现对LLMs内部推理过程的显式引导。研究发现，该范式能全面提升推理模型的多项能力：在IFEval上的指令遵循、SEP的指令层级理解、XSTest和SorryBench的安全对齐等方面均取得显著改进。实验结果表明，思维干预显著优于基线提示方法，在使用开源DeepSeek R1模型时，指令遵循场景准确率最高提升6.7%，指令层级推理能力提高15.4%，对不安全提示的拒绝率增加40.0%。本研究为控制推理型大语言模型开辟了新的研究方向。

Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models

Abstract

arXiv:2503.14411v2 Announce Type: replace-cross Abstract: Temporal graph neural networks (TGNNs) have shown remarkable performance in temporal graph modeling. However, real-world temporal graphs often possess rich textual information, giving rise to temporal text-attributed graphs (TTAGs). Such combination of dynamic text semantics and evolving graph structures introduces heightened complexity. Existing TGNNs embed texts statically and rely heavily on encoding mechanisms that biasedly prioritize structural information, overlooking the temporal evolution of text semantics and the essential interplay between semantics and structures for synergistic reinforcement. To tackle these issues, we present \textbf{CROSS}, a flexible framework that seamlessly extends existing TGNNs for TTAG modeling. CROSS is designed by decomposing the TTAG modeling process into two phases: (i) temporal semantics extraction; and (ii) semantic-structural information unification. The key idea is to advance the large language models (LLMs) to dynamically extract the temporal semantics in text space and then generate cohesive representations unifying both semantics and structures. Specifically, we propose a Temporal Semantics Extractor in the CROSS framework, which empowers LLMs to offer the temporal semantic understanding of node's evolving contexts of textual neighborhoods, facilitating semantic dynamics. Subsequently, we introduce the Semantic-structural Co-encoder, which collaborates with the above Extractor for synthesizing illuminating representations by jointly considering both semantic and structural information while encouraging their mutual reinforcement. Extensive experiments show that CROSS achieves state-of-the-art results on four public datasets and one industrial dataset, with 24.7% absolute MRR gain on average in temporal link prediction and 3.7% AUC gain in node classification of industrial application.

摘要

时间图神经网络（TGNNs）在时序图建模中展现出卓越性能。然而现实世界中的时序图常包含丰富文本信息，由此催生了时序文本属性图（TTAGs）。动态文本语义与演化图结构的结合带来了更高复杂性。现有TGNNs采用静态文本嵌入方式，且过度依赖偏向结构信息的编码机制，忽视了文本语义的时序演化及其与结构间协同强化的本质关联。为解决这些问题，我们提出CROSS框架，可灵活扩展现有TGNNs以支持TTAG建模。该框架将建模过程解耦为两个阶段：（i）时序语义提取；（ii）语义-结构信息融合。其核心思想是推动大语言模型（LLMs）动态提取文本空间的时序语义，进而生成统一语义与结构的凝聚表征。具体而言，我们设计了时序语义提取器，使LLMs能够理解节点文本邻域的时序语境变化，捕捉语义动态性；随后提出语义-结构协同编码器，与提取器协作生成融合双重信息的表征，促进语义与结构的相互增强。大量实验表明，CROSS在四个公共数据集和一个工业数据集上取得最先进成果：时序链接预测任务平均绝对MRR提升24.7%，工业应用场景的节点分类任务AUC提升3.7%。

ImF: Implicit Fingerprint for Large Language Models

Abstract

arXiv:2503.21805v2 Announce Type: replace-cross Abstract: Training large language models (LLMs) is resource-intensive and expensive, making protecting intellectual property (IP) for LLMs crucial. Recently, embedding fingerprints into LLMs has emerged as a prevalent method for establishing model ownership. However, existing fingerprinting techniques typically embed identifiable patterns with weak semantic coherence, resulting in fingerprints that significantly differ from the natural question-answering (QA) behavior inherent to LLMs. This discrepancy undermines the stealthiness of the embedded fingerprints and makes them vulnerable to adversarial attacks. In this paper, we first demonstrate the critical vulnerability of existing fingerprint embedding methods by introducing a novel adversarial attack named Generation Revision Intervention (GRI) attack. GRI attack exploits the semantic fragility of current fingerprinting methods, effectively erasing fingerprints by disrupting their weakly correlated semantic structures. Our empirical evaluation highlights that traditional fingerprinting approaches are significantly compromised by the GRI attack, revealing severe limitations in their robustness under realistic adversarial conditions. To advance the state-of-the-art in model fingerprinting, we propose a novel model fingerprint paradigm called Implicit Fingerprints (ImF). ImF leverages steganography techniques to subtly embed ownership information within natural texts, subsequently using Chain-of-Thought (CoT) prompting to construct semantically coherent and contextually natural QA pairs. This design ensures that fingerprints seamlessly integrate with the standard model behavior, remaining indistinguishable from regular outputs and substantially reducing the risk of accidental triggering and targeted removal. We conduct a comprehensive evaluation of ImF on 15 diverse LLMs, spanning different architectures and varying scales.

摘要

训练大型语言模型（LLMs）需要大量资源且成本高昂，因此保护LLMs的知识产权（IP）至关重要。近期，将指纹嵌入LLMs已成为确立模型所有权的主流方法。然而，现有指纹技术通常嵌入语义连贯性较弱的可识别模式，导致指纹与LLMs固有的自然问答（QA）行为存在显著差异。这种差异削弱了嵌入指纹的隐蔽性，使其易受对抗攻击。本文首先通过提出一种名为"生成修正干预"（GRI）的新型对抗攻击，揭示了现有指纹嵌入方法的关键脆弱性。GRI攻击利用当前指纹方法的语义脆弱性，通过破坏其弱相关的语义结构有效擦除指纹。实证评估表明，传统指纹方法在GRI攻击下严重受损，暴露出其在真实对抗条件下鲁棒性的重大局限。为推进模型指纹技术发展，我们提出名为"隐式指纹"（ImF）的新型模型指纹范式。ImF利用隐写术技术将所有权信息微妙地嵌入自然文本，继而通过思维链（CoT）提示构建语义连贯且上下文自然的QA对。该设计确保指纹与标准模型行为无缝融合，与常规输出无法区分，显著降低意外触发和针对性移除的风险。我们在15种不同架构和规模的LLMs上对ImF进行了全面评估。

ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation

Abstract

arXiv:2503.21729v3 Announce Type: replace-cross Abstract: Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy. While recent works equip reinforcement learning (RL)-based LRMs with retrieval capabilities, they suffer from overthinking and lack robustness in reasoning, reducing their effectiveness in question answering (QA) tasks. To address this, we propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations. Our solution includes a novel data construction framework with an upper bound on the reasoning chain length. Specifically, we first leverage an LRM to generate deliberate thinking, then select an action from a predefined action space (Search and Finish). For Search action, a query is executed against the RAG engine, where the result is returned as observation to guide reasoning steps later. This process iterates until a Finish action is chosen. Benefiting from ReaRAG's strong reasoning capabilities, our approach outperforms existing baselines on multi-hop QA. Further analysis highlights its strong reflective ability to recognize errors and refine its reasoning trajectory. Our study enhances LRMs' factuality while effectively integrating robust reasoning for Retrieval-Augmented Generation (RAG).

摘要

大型推理模型（LRMs）展现出卓越的推理能力，但其主要依赖参数化知识，导致事实准确性受限。尽管近期研究为基于强化学习（RL）的LRMs配备了检索功能，这些模型仍存在过度思考与推理鲁棒性不足的问题，降低了其在问答（QA）任务中的有效性。为此，我们提出ReaRAG——一种事实性增强的推理模型，该模型能在避免过度迭代的前提下探索多样化查询。我们的解决方案包含一个具有推理链长度上限的新型数据构建框架。具体而言，首先利用LRM生成审慎思考，随后从预定义动作空间（搜索与完成）中选择动作。若选择搜索动作，则向RAG引擎执行查询，返回结果作为后续推理步骤的观察依据。此过程迭代直至选择完成动作为止。得益于ReaRAG强大的推理能力，我们的方法在多跳QA任务中优于现有基线。进一步分析表明，该模型具备识别错误并优化推理轨迹的强反思能力。本研究在增强LRMs事实性的同时，有效整合了检索增强生成（RAG）所需的鲁棒推理能力。

Detecting LLM-Generated Peer Reviews

Abstract

arXiv:2503.15772v2 Announce Type: replace-cross Abstract: The integrity of peer review is fundamental to scientific progress, but the rise of large language models (LLMs) has introduced concerns that some reviewers may rely on these tools to generate reviews rather than writing them independently. Although some venues have banned LLM-assisted reviewing, enforcement remains difficult as existing detection tools cannot reliably distinguish between fully generated reviews and those merely polished with AI assistance. In this work, we address the challenge of detecting LLM-generated reviews. We consider the approach of performing indirect prompt injection via the paper's PDF, prompting the LLM to embed a covert watermark in the generated review, and subsequently testing for presence of the watermark in the review. We identify and address several pitfalls in na"ive implementations of this approach. Our primary contribution is a rigorous watermarking and detection framework that offers strong statistical guarantees. Specifically, we introduce watermarking schemes and hypothesis tests that control the family-wise error rate across multiple reviews, achieving higher statistical power than standard corrections such as Bonferroni, while making no assumptions about the nature of human-written reviews. We explore multiple indirect prompt injection strategies--including font-based embedding and obfuscated prompts--and evaluate their effectiveness under various reviewer defense scenarios. Our experiments find high success rates in watermark embedding across various LLMs. We also empirically find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice. In contrast, we find that Bonferroni-style corrections are too conservative to be useful in this setting.

摘要

同行评审的诚信是科学进步的基石，然而大型语言模型（LLM）的兴起引发了新的担忧——部分评审者可能依赖此类工具生成评审意见而非独立撰写。尽管部分学术会议已禁止LLM辅助评审，但由于现有检测工具无法可靠区分完全由AI生成的评审与仅经AI润色的评审，实际监管仍面临困难。本研究致力于解决LLM生成评审的检测难题：我们提出通过论文PDF实施间接提示注入，诱导LLM在生成评审中嵌入隐蔽水印，继而检测该水印是否存在。针对该方案原始实现中的若干缺陷，我们进行了系统性改进。核心贡献在于建立了一套具有严格统计保证的水印检测框架：我们提出的水印方案与假设检验方法能有效控制多重评审中的家族错误率，其统计功效优于Bonferroni等传统校正方法，且无需对人类撰写评审的特性做任何假设。我们探索了多种间接提示注入策略（包括基于字体的嵌入和混淆提示），并在不同评审者防御场景下评估其有效性。实验表明，该方法在各类LLM中均能实现高成功率的水印嵌入。实证研究还发现：（1）本方法对常见评审者防御具有强鲁棒性；（2）统计检验中的错误率边界在实践中成立。相比之下，Bonferroni式校正因过于保守而在此场景中失去实用性。

Large Language Models Could Be Rote Learners

Abstract

arXiv:2504.08300v4 Announce Type: replace-cross Abstract: Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework reformulating MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).

摘要

多项选择题（MCQ）基准被广泛用于评估大语言模型（LLMs），但其可靠性受到基准污染的削弱。本研究将污染重新定义为学习的内在组成部分，旨在区分LLM评估中真正的能力获取与表面记忆。首先，通过分析模型在不同记忆条件下的表现，我们发现了一个反直觉趋势：LLMs在已记忆的MCQ上表现反而比未记忆的更差，这表明存在两种不同的学习现象共存，即机械记忆与真实能力学习。为区分二者，我们提出TrinEval——一种将MCQ重构为三元组形式的新评估框架，在保留知识评估的同时减少记忆依赖。实验验证了TrinEval在重构中的有效性，其评估显示常见LLM可能通过机械记忆方式掌握20.5%的知识点（以MMLU平均值为准）。

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Abstract

arXiv:2504.07891v2 Announce Type: replace-cross Abstract: Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of reasoning benchmarks, SpecReason achieves $1.4-3.0\times$ speedup over vanilla LRM inference while improving accuracy by $0.4-9.0\%$ . Compared to speculative decoding without SpecReason, their combination yields an additional $8.8-58.0\%$ latency reduction. We open-source SpecReason at https://github.com/ruipeterpan/specreason.

摘要

推理时计算的最新进展通过利用大型推理模型（LRMs）生成长链思维（CoTs），显著提升了复杂任务的性能。然而，这种准确性的提升伴随着高推理延迟的代价，这源于生成推理序列的长度和自回归解码的特性。我们在解决这些开销时的关键发现是：LRM推理及其嵌入的推理过程对近似具有高度容忍性——复杂任务通常被分解为更简单的步骤，每个步骤的效用取决于其为下游步骤提供的语义洞察，而非其生成的确切标记。为此，我们提出了SpecReason系统，该系统通过使用轻量级模型（推测性地）执行较简单的中间推理步骤，并仅保留昂贵的基础模型用于评估（及可能纠正）推测输出，从而自动加速LRM推理。值得注意的是，SpecReason通过利用思维标记的语义灵活性来保持最终答案准确性，这与先前的推测技术（尤其是要求每一步标记级等效的推测解码）形成互补。在多种推理基准测试中，SpecReason实现了比原始LRM推理1.4-3.0倍的加速，同时将准确性提高了0.4-9.0%。与未结合SpecReason的推测解码相比，二者的组合可额外降低8.8-58.0%的延迟。我们在https://github.com/ruipeterpan/specreason开源了SpecReason。

Mimic In-Context Learning for Multimodal Tasks

Abstract

arXiv:2504.08851v2 Announce Type: replace-cross Abstract: Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors" added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://github.com/Kamichanw/MimIC.

摘要

近年来，上下文学习（ICL）已成为大型多模态模型（LMMs）中的重要推理范式，通过少量上下文示例（ICDs）来引导LMMs完成新任务。然而，多模态数据的协同效应增加了ICL性能对ICD配置的敏感性，从而需要一种更稳定且通用的映射函数。从数学角度看，在基于Transformer的模型中，ICDs作为"偏移向量"被添加到查询标记的隐藏状态中。受此启发，我们提出了模仿上下文学习（MimIC），旨在从ICDs中学习稳定且可泛化的偏移效应。具体而言，与以往基于偏移向量的方法相比，MimIC通过将轻量级可学习模块集成到LMMs中，并引入四项关键改进，更严格地逼近偏移效应：1）在注意力层后插入偏移向量，2）为每个注意力头分配一个偏移向量，3）使偏移幅度与查询相关，4）采用分层对齐损失。在两个LMMs（Idefics-9b和Idefics2-8b-base）上针对三项多模态任务（VQAv2、OK-VQA、Captioning）的广泛实验表明，MimIC优于现有的基于偏移向量的方法。代码发布于https://github.com/Kamichanw/MimIC。

CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models

Abstract

arXiv:2504.13534v2 Announce Type: replace-cross Abstract: Chain-of-thought (CoT) reasoning boosts large language models' (LLMs) performance on complex tasks but faces two key limitations: a lack of reliability when solely relying on LLM-generated reasoning chains and interference from natural language reasoning steps with the models' inference process, also known as the inference logic of LLMs. To address these issues, we propose CoT-RAG, a novel reasoning framework with three key designs: (i) Knowledge Graph-driven CoT Generation,featuring knowledge graphs to modulate reasoning chain generation of LLMs, thereby enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which incorporates retrieval-augmented generation (RAG) into knowledge graphs to retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable information; (iii) Pseudo-Program Prompting Execution, which promotes greater logical rigor by guiding LLMs to execute reasoning tasks as pseudo-programs. Evaluations on nine public datasets spanning three reasoning tasks reveal significant accuracy gains--ranging from 4.0% to 44.3%--over state-of-the-art methods. Furthermore, tests on four domain-specific datasets demonstrate exceptional accuracy and efficient execution, underscoring its practical applicability and scalability.

摘要

链式思维（CoT）推理虽然能提升大语言模型（LLMs）在复杂任务中的表现，但存在两个关键缺陷：仅依赖LLM生成的推理链时可靠性不足，以及自然语言推理步骤会干扰模型的推断过程（即LLMs的推理逻辑）。为解决这些问题，我们提出CoT-RAG——一种新型推理框架，包含三项核心设计：（1）知识图谱驱动的CoT生成：通过知识图谱调控LLMs的推理链生成，从而增强推理可信度；（2）可学习知识情境感知的RAG：将检索增强生成（RAG）融入知识图谱以检索相关子案例和子描述，为LLMs提供可学习信息；（3）伪程序提示执行：通过引导LLMs以伪程序形式执行推理任务，提升逻辑严谨性。在涵盖三类推理任务的九个公共数据集上的评估表明，该方法相较最先进技术实现了4.0%至44.3%的显著准确率提升。此外，在四个领域专用数据集上的测试展现出优异的准确性和高效执行效率，印证了其实际适用性和可扩展性。

Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

Abstract

arXiv:2504.13123v2 Announce Type: replace-cross Abstract: In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents following key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.3% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 15 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to identical images with alt-text. In 20 common cognitive domains, the model trained with our data outperforms the alt-text data by at least 7.5%. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark.

摘要

近年来，视觉语言模型预训练领域取得了快速进展，这主要得益于大语言模型文本能力的持续提升。然而，现有多模态大语言模型的训练范式高度依赖高质量图文配对数据。随着模型和数据规模呈指数级增长，此类精细标注数据的可获得性日益稀缺并趋于饱和，严重制约了该领域的进一步发展。本研究探索了面向视觉语言模型预训练的可扩展描述生成技术，证实大规模低幻觉合成描述具有双重作用：1）可作为预训练范式中真实数据的有效替代品；2）经实证验证，将其整合至视觉语言模型后可实现更优异的性能提升。本文主要贡献包括：1）提出了一种生成高质量、低幻觉且富含知识的合成描述的新流程。我们提出的持续DPO方法在降低幻觉方面成效显著，7B规模模型在保留测试集上的非幻觉描述率从48.3%提升至77.9%。2）全面实证研究表明，我们的合成描述能带来更优越的预训练优势。在15项视觉语言任务中，使用本数据训练的模型相较采用替代文本的相同图像至少获得6.2%的性能提升；在20个常见认知领域，我们的训练数据使模型性能较替代文本数据至少高出7.5%。同时，该数据在文生图领域也展现出显著支持效果，使用本数据集使真实世界验证基准的FID分数降低17.1，MSCOCO验证基准降低13.3。

Abstract

arXiv:2504.13945v4 Announce Type: replace-cross Abstract: The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.

摘要

大型视觉语言模型（LVLMs）的快速发展显著推动了文档理解领域的应用，特别是在光学字符识别（OCR）与多语言翻译方面。然而，当前对LVLMs的评估（如广泛使用的OCRBench）主要集中于验证其短文本响应及简单版式长文本响应的正确性，而对复杂版式设计的长文本理解能力的评估虽至关重要却长期被忽视。本文提出菜单OCR与翻译基准（MOTBench），该专项评估框架强调菜单翻译在跨文化交流中的关键作用。MOTBench要求LVLMs准确识别并翻译菜单中的每道菜品及其价格、计量单位条目，从而全面评估其视觉理解与语言处理能力。我们的基准由一组中英文菜单构成，这些菜单具有复杂的版式设计、多样化的字体风格以及跨语言的文化特定元素，并附有精确的人工标注。实验表明，我们的自动评估结果与专业人工评估高度一致。我们评估了一系列公开的最先进LVLMs，通过分析其输出结果识别性能优劣，为LVLM的未来发展提供了有价值的指导。MOTBench可通过https://github.com/gitwzl/MOTBench获取。

OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

Abstract

arXiv:2504.16918v2 Announce Type: replace-cross Abstract: Optimization plays a vital role in scientific research and practical applications. However, formulating a concrete optimization problem described in natural language into a mathematical form and selecting a suitable solver to solve the problem requires substantial domain expertise. We introduce OptimAI, a framework for solving Optimization problems described in natural language by leveraging LLM-powered AI agents, and achieve superior performance over current state-of-the-art methods. Our framework is built upon the following key roles: (1) a formulator that translates natural language problem descriptions into precise mathematical formulations; (2) a planner that constructs a high-level solution strategy prior to execution; and (3) a coder and a code critic capable of interacting with the environment and reflecting on outcomes to refine future actions. Ablation studies confirm that all roles are essential; removing the planner or code critic results in $5.8\times$ and $3.1\times$ drops in productivity, respectively. Furthermore, we introduce UCB-based debug scheduling to dynamically switch between alternative plans, yielding an additional $3.3\times$ productivity gain. Our design emphasizes multi-agent collaboration, and our experiments confirm that combining diverse models leads to performance gains. Our approach attains 88.1% accuracy on the NLP4LP dataset and 82.3% on the Optibench dataset, reducing error rates by 58% and 52%, respectively, over prior best results.

摘要

优化在科学研究和实际应用中发挥着至关重要的作用。然而，将自然语言描述的优化问题转化为数学形式并选择合适的求解器需要深厚的领域专业知识。我们提出了OptimAI框架，通过利用基于大语言模型的智能代理来解决自然语言描述的优化问题，其性能优于当前最先进方法。该框架基于以下核心角色构建：(1) 问题表述器——将自然语言描述转化为精确的数学表述；(2) 规划器——在执行前构建高层解决方案策略；(3) 编码器与代码评审器——能够与环境交互并通过结果反思来优化后续操作。消融实验证实所有角色都不可或缺：移除规划器或代码评审器分别会导致效率下降5.8倍和3.1倍。此外，我们提出基于UCB的调试调度机制来动态切换备选方案，实现了3.3倍的额外效率提升。我们的设计强调多智能体协作，实验证明组合不同模型能带来性能提升。该方法在NLP4LP数据集上达到88.1%的准确率，在Optibench数据集上达到82.3%的准确率，相较之前最佳结果分别降低了58%和52%的错误率。

VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

Abstract

arXiv:2504.19627v2 Announce Type: replace-cross Abstract: Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is inefficient compared to humans who analyze information and generate content at the conceptual level, extracting relevant visual concepts with minimal effort. This inefficiency, stemming from the lack of a visual concept model, limits LVLMs' usability in real-world applications. To address this, we propose VCM, an end-to-end self-supervised visual concept modeling framework. VCM leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations. Our results show that VCM significantly reduces computational costs (e.g., 85% fewer FLOPs for LLaVA-1.5-7B) while maintaining strong performance across diverse image understanding tasks. Moreover, VCM enhances visual encoders' capabilities in classic visual concept perception tasks. Extensive quantitative and qualitative experiments validate the effectiveness and efficiency of VCM.

摘要

大型视觉语言模型（LVLMs）因其强大的视觉语言推理能力，在具身智能等现实世界人工智能任务中具有关键作用。然而，当前LVLMs在令牌级别处理整幅图像的方式效率低下，与人类在概念层面分析信息并生成内容、以最小努力提取相关视觉概念的方式形成鲜明对比。这种因缺乏视觉概念模型导致的低效性，限制了LVLMs在实际应用中的可用性。为此，我们提出VCM——一种端到端自监督视觉概念建模框架。VCM通过跨多采样实例的隐式对比学习和视觉语言微调，无需昂贵的概念级标注即可构建视觉概念模型。实验结果表明，VCM在保持多样化图像理解任务性能的同时显著降低计算成本（例如LLaVA-1.5-7B的FLOPs减少85%）。此外，VCM增强了视觉编码器在经典视觉概念感知任务中的能力。大量定量与定性实验验证了VCM的有效性和高效性。

BrainPrompt: Multi-Level Brain Prompt Enhancement for Neurological Condition Identification

Abstract

arXiv:2504.16096v2 Announce Type: replace-cross Abstract: Neurological conditions, such as Alzheimer's Disease, are challenging to diagnose, particularly in the early stages where symptoms closely resemble healthy controls. Existing brain network analysis methods primarily focus on graph-based models that rely solely on imaging data, which may overlook important non-imaging factors and limit the model's predictive power and interpretability. In this paper, we present BrainPrompt, an innovative framework that enhances Graph Neural Networks (GNNs) by integrating Large Language Models (LLMs) with knowledge-driven prompts, enabling more effective capture of complex, non-imaging information and external knowledge for neurological disease identification. BrainPrompt integrates three types of knowledge-driven prompts: (1) ROI-level prompts to encode the identity and function of each brain region, (2) subject-level prompts that incorporate demographic information, and (3) disease-level prompts to capture the temporal progression of disease. By leveraging these multi-level prompts, BrainPrompt effectively harnesses knowledge-enhanced multi-modal information from LLMs, enhancing the model's capability to predict neurological disease stages and meanwhile offers more interpretable results. We evaluate BrainPrompt on two resting-state functional Magnetic Resonance Imaging (fMRI) datasets from neurological disorders, showing its superiority over state-of-the-art methods. Additionally, a biomarker study demonstrates the framework's ability to extract valuable and interpretable information aligned with domain knowledge in neuroscience. The code is available at https://github.com/AngusMonroe/BrainPrompt

摘要

阿尔茨海默病等神经系统疾病的诊断具有挑战性，尤其在早期阶段，其症状与健康对照组极为相似。现有的脑网络分析方法主要集中于仅依赖影像数据的基于图模型的方法，这可能忽略重要的非影像因素，并限制模型的预测能力和可解释性。本文提出BrainPrompt，一种创新框架，通过将大型语言模型（LLMs）与知识驱动的提示相结合来增强图神经网络（GNNs），从而更有效地捕捉复杂的非影像信息和外部知识，以用于神经系统疾病的识别。BrainPrompt整合了三种知识驱动的提示：(1) ROI级提示，用于编码每个脑区的身份和功能；(2) 受试者级提示，用于纳入人口统计信息；(3) 疾病级提示，用于捕捉疾病的时间进展。通过利用这些多级提示，BrainPrompt有效地利用了来自LLMs的知识增强多模态信息，提升了模型预测神经系统疾病阶段的能力，同时提供了更具可解释性的结果。我们在两个来自神经系统疾病的静息态功能磁共振成像（fMRI）数据集上评估了BrainPrompt，结果显示其优于现有最先进方法。此外，一项生物标志物研究表明，该框架能够提取与神经科学领域知识一致的有价值且可解释的信息。代码可在https://github.com/AngusMonroe/BrainPrompt获取。

Dynamic Early Exit in Reasoning Models

Abstract

arXiv:2504.15895v2 Announce Type: replace-cross Abstract: Recent advances in large reasoning language models (LRLMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points (e.g.,"Wait" tokens) and dynamically terminates the next reasoning chain's generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on 10 reasoning benchmarks (e.g., GSM8K, MATH-500, AMC, GPQA, AIME and LiveCodeBench) show that the proposed method is consistently effective on 11 cutting-edge reasoning LLMs of varying series and sizes, reducing the length of CoT sequences by an average of 19.1% to 80.1% while improving accuracy by 0.3% to 5.0%.

摘要

大规模推理语言模型（LRLMs）的最新进展依赖于测试时扩展，通过生成长链思维（CoT）来解决复杂任务。然而，过长的CoT不仅会降低问题求解效率，还可能因推理步骤过于详细或冗余而导致准确性下降。我们提出了一种简单而有效的方法，使LLMs能够在生成过程中通过提前退出来自我截断CoT序列。该方法不依赖固定启发式规则，而是在潜在推理转换点（如"Wait"标记）监测模型行为，当模型对试验答案表现出高置信度时，动态终止后续推理链的生成。本方法无需额外训练，可无缝集成到现有类o1推理LLMs中。在10个推理基准测试（如GSM8K、MATH-500、AMC、GPQA、AIME和LiveCodeBench）上的实验表明，该方法对11个不同系列和规模的尖端推理LLMs均具有持续有效性，平均将CoT序列长度缩短19.1%至80.1%，同时将准确率提高0.3%至5.0%。

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Abstract

arXiv:2504.15585v2 Announce Type: replace-cross Abstract: The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs. To address this gap, this paper introduces, for the first time, the concept of "full-stack" safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.

摘要

大型语言模型（LLM）的显著成功为学术界和工业界实现通用人工智能指明了一条充满希望的道路，这得益于其在各类应用中展现出的前所未有的性能。随着LLM在研究和商业领域的影响力持续扩大，其安全性和潜在风险已引起研究者、企业乃至各国的高度关注。当前已有的LLM安全性综述主要聚焦于模型生命周期的特定阶段（如部署阶段或微调阶段），缺乏对LLM完整"生命链"的系统性认知。为填补这一空白，本文首次提出"全栈安全"概念，旨在系统考量LLM从训练、部署到最终商业化的全流程安全问题。相较于现有LLM安全综述，本研究展现出以下显著优势：（I）全景视角。我们将完整LLM生命周期定义为包含数据准备、预训练、训练后处理、部署及最终商业化五个阶段。据我们所知，这是首个涵盖LLM全生命周期的安全综述。（II）广泛文献支撑。本研究基于对800余篇文献的 exhaustive 梳理，确保在更全局的认知框架下实现安全问题的全面覆盖与系统化组织。（III）独到见解。通过系统性文献分析，我们为每个章节构建了可靠的研究路线图与观点体系。研究发现数据生成安全、对齐技术、模型编辑以及基于LLM的智能体系统等方向具有重要研究价值，这些洞见可为该领域未来研究提供有价值的指引。

Process Reward Models That Think

Abstract

arXiv:2504.16828v2 Announce Type: replace-cross Abstract: Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.

摘要

逐步验证器（亦称过程奖励模型PRM）是测试时规模扩展的关键组件。传统PRM需要步骤级监督信号，导致训练成本高昂。本研究致力于构建数据高效的语言化分步奖励模型，通过生成验证思维链（CoT）对解题每一步进行核查。我们提出ThinkPRM——一种长思维链验证器，其微调所需的过程标注量比判别式PRM减少数个数量级。该方法充分发挥长思维链模型固有的推理能力，在仅使用PRM800K数据集1%过程标注的条件下，于多个高难度基准测试（包括ProcessBench、MATH-500和AIME '24）中超越LLM-as-a-Judge和判别式验证器，无论采用N选优策略还是奖励引导搜索。在GPQA-Diamond子集和LiveCodeBench的跨域评估中，我们的PRM模型分别以8%和4.5%的优势战胜了使用完整PRM800K训练的判别式验证器。在相同token预算下，ThinkPRM比LLM-as-a-Judge更高效地扩展验证计算量，在ProcessBench子集上领先7.2%。本研究证明：生成式长思维链PRM能以极低的监督成本扩展验证时的计算规模。代码、数据及模型将于https://github.com/mukhal/thinkprm发布。

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

Abstract

arXiv:2504.19162v2 Announce Type: replace-cross Abstract: Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, SPC can guide the test-time search of diverse LLMs and significantly improve their mathematical reasoning performance on MATH500 and AIME2024, surpassing those guided by state-of-the-art process reward models.

摘要

评估大型语言模型（LLM）逐步推理（如思维链）的可靠性仍具挑战性，这主要源于获取高质量步骤级监督数据的难度与成本。本文提出自我博弈批评器（SPC），该方法通过对抗性自我博弈游戏使批评模型逐步提升推理步骤评估能力，无需人工步骤标注。SPC通过微调基础模型的两个副本分别扮演两种角色："狡诈生成器"刻意生成难以检测的错误推理步骤，而"批评器"则负责分析步骤正确性。二者进行对抗博弈：生成器试图欺骗批评器，批评器则努力识别错误。基于博弈结果的强化学习驱动模型迭代优化——每轮对抗的胜者获得正奖励，败者获得负奖励，从而实现持续自我进化。在三个推理过程基准测试（ProcessBench、PRM800K、DeltaBench）上的实验表明，SPC能持续提升错误检测能力（如ProcessBench准确率从70.8%提升至77.7%），并超越包括蒸馏R1模型在内的强基线。此外，SPC能引导不同LLM的测试时搜索，显著提升其在MATH500和AIME2024上的数学推理性能，其效果优于当前最先进的流程奖励模型引导的结果。

Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

Abstract

arXiv:2505.00979v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have achieved remarkable success but remain data-inefficient, especially when learning from small, specialized corpora with limited and proprietary data. Existing synthetic data generation methods for continue pre-training focus on intra-document content and overlook cross-document knowledge associations, limiting content diversity and depth. We propose Synthetic-on-Graph (SoG), a synthetic data generation framework that incorporates cross-document knowledge associations for efficient corpus expansion. SoG constructs a context graph by extracting entities and concepts from the original corpus, representing cross-document associations, and employing a graph walk strategy for knowledge-associated sampling. This enhances synthetic data diversity and coherence, enabling models to learn complex knowledge structures and handle rare knowledge. To further improve synthetic data quality, we integrate Chain-of-Thought (CoT) and Contrastive Clarifying (CC) synthetic, enhancing reasoning processes and discriminative power. Experiments show that SoG outperforms the state-of-the-art (SOTA) method in a multi-hop document Q&A dataset while performing comparably to the SOTA method on the reading comprehension task datasets, which also underscores the better generalization capability of SoG. Our work advances synthetic data generation and provides practical solutions for efficient knowledge acquisition in LLMs, especially in domains with limited data availability.

摘要

大型语言模型（LLMs）已取得显著成功，但在数据利用效率方面仍存在不足，尤其当面对数据有限且专有的小型专业语料库进行持续预训练时。现有合成数据生成方法主要关注文档内部内容，忽视了跨文档知识关联，导致生成内容多样性和深度受限。本文提出基于图结构的合成数据生成框架SoG（Synthetic-on-Graph），通过融入跨文档知识关联实现高效语料扩展。SoG首先从原始语料中提取实体与概念构建上下文图以表征跨文档关联，随后采用图游走策略进行知识关联采样。该方法显著提升合成数据的多样性与连贯性，使模型能够学习复杂知识结构并处理罕见知识。为进一步提高数据质量，我们整合思维链（CoT）与对比澄清（CC）合成技术，强化推理过程与判别能力。实验表明，SoG在多跳文档问答数据集上优于当前最优方法，在阅读理解任务数据集上表现相当，同时凸显了更优的泛化能力。本研究推动了合成数据生成技术的发展，并为LLMs在数据稀缺领域的高效知识获取提供了实用解决方案。

ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant

Abstract

arXiv:2505.03654v2 Announce Type: replace-cross Abstract: Recent advances in personalized MLLMs enable effective capture of user-specific concepts, supporting both recognition of personalized concepts and contextual captioning. However, humans typically explore and reason over relations among objects and individuals, transcending surface-level information to achieve more personalized and contextual understanding. To this end, existing methods may face three main limitations: Their training data lacks multi-object sets in which relations among objects are learnable. Building on the limited training data, their models overlook the relations between different personalized concepts and fail to reason over them. Their experiments mainly focus on a single personalized concept, where evaluations are limited to recognition and captioning tasks. To address the limitations, we present a new dataset named ReGraP, consisting of 120 sets of personalized knowledge. Each set includes images, KGs, and CoT QA pairs derived from the KGs, enabling more structured and sophisticated reasoning pathways. We propose ReGraP-LLaVA, an MLLM trained with the corresponding KGs and CoT QA pairs, where soft and hard graph prompting methods are designed to align KGs within the model's semantic space. We establish the ReGraP Benchmark, which contains diverse task types: multiple-choice, fill-in-the-blank, True/False, and descriptive questions in both open- and closed-ended settings. The proposed benchmark is designed to evaluate the relational reasoning and knowledge-connection capability of personalized MLLMs. We conduct experiments on the proposed ReGraP-LLaVA and other competitive MLLMs. Results show that the proposed model not only learns personalized knowledge but also performs relational reasoning in responses, achieving the SoTA performance compared with the competitive methods. All the codes and datasets are released at: https://github.com/xyfyyds/ReGraP.

摘要

个性化多模态大语言模型（MLLMs）的最新进展能够有效捕捉用户特定概念，同时支持个性化概念识别和上下文描述。然而，人类通常会对物体与个体间的关系进行探索和推理，超越表层信息以实现更具个性化和上下文关联的理解。现有方法可能面临三个主要局限：其训练数据缺乏可学习物体间关系的多对象集合；基于有限训练数据，模型忽视了个性化概念间的关系且未能进行推理；实验主要集中于单一个性化概念，评估仅限于识别和描述任务。为此，我们提出新数据集ReGraP，包含120组个性化知识集合，每组涵盖图像、知识图谱（KGs）及基于KGs生成的思维链问答对（CoT QA），以支持更结构化、复杂的推理路径。我们提出ReGraP-LLaVA模型，通过KGs和CoT QA对进行训练，并设计软硬图提示方法将KGs对齐至模型语义空间。建立ReGraP基准测试，包含多选题、填空题、判断题及开放/封闭式描述题等多样化任务类型，用于评估个性化MLLMs的关系推理与知识联结能力。在ReGraP-LLaVA及竞争性MLLMs上的实验表明，该模型不仅能学习个性化知识，还能在响应中执行关系推理，相较竞争方法达到最先进性能。所有代码与数据集发布于：https://github.com/xyfyyds/ReGraP。

Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions
- Abstract
- 摘要
PeerGuard: Defending Multi-Agent Systems Against Backdoor Attacks Through Mutual Reasoning
- Abstract
- 摘要
FLOW-BENCH: Towards Conversational Generation of Enterprise Workflows
- Abstract
- 摘要
Probing the Vulnerability of Large Language Models to Polysemantic Interventions
- Abstract
- 摘要
Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling
- Abstract
- 摘要
DMN-Guided Prompting: A Low-Code Framework for Controlling LLM Behavior
- Abstract
- 摘要
LLM Agents Are Hypersensitive to Nudges
- Abstract
- 摘要
Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing
- Abstract
- 摘要
Heart2Mind: Human-Centered Contestable Psychiatric Disorder Diagnosis System using Wearable ECG Monitors
- Abstract
- 摘要
OMAC: A Broad Optimization Framework for LLM-Based Multi-Agent Collaboration
- Abstract
- 摘要
REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning
- Abstract
- 摘要
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges
- Abstract
- 摘要
Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
- Abstract
- 摘要
ChatHTN: Interleaving Approximate (LLM) and Symbolic HTN Planning
- Abstract
- 摘要
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study
- Abstract
- 摘要
Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling
- Abstract
- 摘要
ToLeaP: Rethinking Development of Tool Learning with Large Language Models
- Abstract
- 摘要
Fair-PP: A Synthetic Dataset for Aligning LLM with Personalized Preferences of Social Equity
- Abstract
- 摘要
VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation
- Abstract
- 摘要
MLLM-based Discovery of Intrinsic Coordinates and Governing Equations from High-Dimensional Data
- Abstract
- 摘要
LLM-Enhanced Feature Engineering for Multi-Factor Electricity Price Predictions
- Abstract
- 摘要
Evaluating the Logical Reasoning Abilities of Large Reasoning Models
- Abstract
- 摘要
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
- Abstract
- 摘要
Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
- Abstract
- 摘要
LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation
- Abstract
- 摘要
SOCIA: An End-to-End Agentic Framework for Automated Cyber-Physical-Social Simulator Generation
- Abstract
- 摘要
Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier
- Abstract
- 摘要
Interactional Fairness in LLM Multi-Agent Systems: An Evaluation Framework
- Abstract
- 摘要
Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents
- Abstract
- 摘要
Efficient RL Training for Reasoning Models via Length-Aware Optimization
- Abstract
- 摘要
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation
- Abstract
- 摘要
CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction
- Abstract
- 摘要
BeliefNest: A Joint Action Simulator for Embodied Agents with Theory of Mind
- Abstract
- 摘要
LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs
- Abstract
- 摘要
ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
- Abstract
- 摘要
Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge
- Abstract
- 摘要
Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering
- Abstract
- 摘要
SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization
- Abstract
- 摘要
Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance
- Abstract
- 摘要
Reasoning-CV: Fine-tuning Powerful Reasoning LLMs for Knowledge-Assisted Claim Verification
- Abstract
- 摘要
Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems
- Abstract
- 摘要
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks
- Abstract
- 摘要
NeuroGen: Neural Network Parameter Generation via Large Language Models
- Abstract
- 摘要
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
- Abstract
- 摘要
MARGE: Improving Math Reasoning for LLMs with Guided Exploration
- Abstract
- 摘要
ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning
- Abstract
- 摘要
mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model
- Abstract
- 摘要
Bullying the Machine: How Personas Increase LLM Vulnerability
- Abstract
- 摘要
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps
- Abstract
- 摘要
Dense Communication between Language Models
- Abstract
- 摘要
HydraInfer: Hybrid Disaggregated Scheduling for Multimodal Large Language Model Serving
- Abstract
- 摘要
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities
- Abstract
- 摘要
Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs
- Abstract
- 摘要
IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment
- Abstract
- 摘要
Emergent Specialization: Rare Token Neurons in Language Models
- Abstract
- 摘要
Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation
- Abstract
- 摘要
FRAbench and GenEval: Scaling Fine-Grained Aspect Evaluation across Tasks, Modalities
- Abstract
- 摘要
A Study on Distributed Strategies for Deep Learning Applications in GPU Clusters
- Abstract
- 摘要
Reasoning BO: Enhancing Bayesian Optimization with Long-Context Reasoning Power of LLMs
- Abstract
- 摘要
Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks
- Abstract
- 摘要
Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective
- Abstract
- 摘要
TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
- Abstract
- 摘要
LLM-KG-Bench 3.0: A Compass for SemanticTechnology Capabilities in the Ocean of LLMs
- Abstract
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
- Abstract
- 摘要
Language Models That Walk the Talk: A Framework for Formal Fairness Certificates
- Abstract
- 摘要
CAIM: Development and Evaluation of a Cognitive AI Memory Framework for Long-Term Interaction with Intelligent Agents
- Abstract
- 摘要
Zero-Shot Iterative Formalization and Planning in Partially Observable Environments
- Abstract
- 摘要
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
- Abstract
- 摘要
Agentic Publications: An LLM-Driven Framework for Interactive Scientific Publishing, Supplementing Traditional Papers with AI-Powered Knowledge Systems
- Abstract
- 摘要
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment
- Abstract
- 摘要
Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities
- Abstract
- 摘要
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
- Abstract
- 摘要
Multi-Armed Bandits Meet Large Language Models
- Abstract
- 摘要
CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition
- Abstract
- 摘要
AutoMathKG: The automated mathematical knowledge graph based on LLM and vector database
- Abstract
- 摘要
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision
- Abstract
- 摘要
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
- Abstract
- 摘要
CoT-Kinetics: A Theoretical Modeling Assessing LRM Reasoning Process
- Abstract
- 摘要
AI-generated Text Detection: A Multifaceted Approach to Binary and Multiclass Classification
- Abstract
- 摘要
On Technique Identification and Threat-Actor Attribution using LLMs and Embedding Models
- Abstract
- 摘要
AC-LoRA: (Almost) Training-Free Access Control-Aware Multi-Modal LLMs
- Abstract
- 摘要
Assessing Collective Reasoning in Multi-Agent LLMs via Hidden Profile Tasks
- Abstract
- 摘要
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems
- Abstract
- 摘要
InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models
- Abstract
- 摘要
ACSE-Eval: Can LLMs threat model real-world cloud infrastructure?
- Abstract
- 摘要
Tool-Aided Evolutionary LLM for Generative Policy Toward Efficient Resource Management in Wireless Federated Learning
- Abstract
- 摘要
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
- Abstract
- 摘要
The Ripple Effect: On Unforeseen Complications of Backdoor Attacks
- Abstract
- 摘要
Concept-Guided Interpretability via Neural Chunking
- Abstract
- 摘要
Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations
- Abstract
- 摘要
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO
- Abstract
- 摘要
Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs
- Abstract
- 摘要
Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks
- Abstract
- 摘要
Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization
- Abstract
- 摘要
EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents
- Abstract
- 摘要
Token-Level Uncertainty Estimation for Large Language Model Reasoning
- Abstract
- 摘要
Efficient Uncertainty Estimation via Distillation of Bayesian Large Language Models
- Abstract
- 摘要
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
- Abstract
- 摘要
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
- Abstract
- 摘要
Token Masking Improves Transformer-Based Text Classification
- Abstract
- 摘要
Towards Universal Semantics With Large Language Models
- Abstract
- 摘要
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training
- Abstract
- 摘要
Retrospex: Language Agent Meets Offline Reinforcement Learning Critic
- Abstract
- 摘要
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class
- Abstract
- 摘要
CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning
- Abstract
- 摘要
Are vision language models robust to uncertain inputs?
- Abstract
- 摘要
Search-Based Correction of Reasoning Chains for Language Models
- Abstract
- 摘要
On Membership Inference Attacks in Knowledge Distillation
- Abstract
- 摘要
Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning
- Abstract
- 摘要
SplInterp: Improving our Understanding and Training of Sparse Autoencoders
- Abstract
- 摘要
Multilingual Collaborative Defense for Large Language Models
- Abstract
- 摘要
RLAP: A Reinforcement Learning Enhanced Adaptive Planning Framework for Multi-step NLP Task Solving
- Abstract
- 摘要
An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts
- Abstract
- 摘要
SafeVid: Toward Safety Aligned Video Large Multimodal Models
- Abstract
- 摘要
AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning
- Abstract
- 摘要
Fine-Grained ECG-Text Contrastive Learning via Waveform Understanding Enhancement
- Abstract
- 摘要
MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models
- Abstract
- 摘要
Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning
- Abstract
- 摘要
Personalized Author Obfuscation with Large Language Models
- Abstract
- 摘要
ABoN: Adaptive Best-of-N Alignment
- Abstract
- 摘要
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
- Abstract
- 摘要
Attribution Projection Calculus: A Novel Framework for Causal Inference in Bayesian Networks
- Abstract
- 摘要
Decoding the Mind of Large Language Models: A Quantitative Evaluation of Ideology and Biases
- Abstract
- 摘要
Improving Fairness in LLMs Through Testing-Time Adversaries
- Abstract
- 摘要
Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features
- Abstract
- 摘要
Self-Destructive Language Model
- Abstract
- 摘要
Reward Inside the Model: A Lightweight Hidden-State Reward Model for LLM's Best-of-N sampling
- Abstract
- 摘要
LLM-DSE: Searching Accelerator Parameters with LLM Agents
- Abstract
- 摘要
Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training
- Abstract
- 摘要
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
- Abstract
- 摘要
LightRetriever: A LLM-based Hybrid Retrieval Architecture with 1000x Faster Query Inference
- Abstract
- 摘要
Not All Documents Are What You Need for Extracting Instruction Tuning Data
- Abstract
- 摘要
PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLMs
- Abstract
- 摘要
LAMeTA: Intent-Aware Agentic Network Optimization via a Large AI Model-Empowered Two-Stage Approach
- Abstract
- 摘要
Enhance Mobile Agents Thinking Process Via Iterative Preference Learning
- Abstract
- 摘要
The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models
- Abstract
- 摘要
Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds
- Abstract
- 摘要
Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models
- Abstract
- 摘要
CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement
- Abstract
- 摘要
Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward
- Abstract
- 摘要
From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling
- Abstract
- 摘要
Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
- Abstract
- 摘要
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
- Abstract
- 摘要
Traversal Verification for Speculative Tree Decoding
- Abstract
- 摘要
Table-R1: Region-based Reinforcement Learning for Table Understanding
- Abstract
- 摘要
EvoGPT: Enhancing Test Suite Robustness via LLM-Based Generation and Genetic Optimization
- Abstract
- 摘要
PSC: Extending Context Window of Large Language Models via Phase Shift Calibration
- Abstract
- 摘要
SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment
- Abstract
- 摘要
SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization
- Abstract
- 摘要
Towards Budget-Friendly Model-Agnostic Explanation Generation for Large Language Models
- Abstract
- 摘要
Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning
- Abstract
- 摘要
Enhancing Large Language Models with Reward-guided Tree Search for Knowledge Graph Question and Answering
- Abstract
- 摘要
IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems
- Abstract
- 摘要
CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
- Abstract
- 摘要
Measuring Information Distortion in Hierarchical Ultra long Novel Generation Optimal Expansion Ratio
- Abstract
- 摘要
A Survey of Attacks on Large Language Models
- Abstract
- 摘要
AD-AGENT: A Multi-agent Framework for End-to-end Anomaly Detection
- Abstract
- 摘要
Web IP at Risk: Prevent Unauthorized Real-Time Retrieval by Large Language Models
- Abstract
- 摘要
Know3-RAG: A Knowledge-aware RAG Framework with Adaptive Retrieval, Generation, and Filtering
- Abstract
- 摘要
Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
- Abstract
- 摘要
EpiLLM: Unlocking the Potential of Large Language Models in Epidemic Forecasting
- Abstract
- 摘要
Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization
- Abstract
- 摘要
Shadow-FT: Tuning Instruct via Base
- Abstract
- 摘要
SynDec: A Synthesize-then-Decode Approach for Arbitrary Textual Style Transfer via Large Language Models
- Abstract
- 摘要
PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLMs
- Abstract
- 摘要
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
- Abstract
- 摘要
Bias Fitting to Mitigate Length Bias of Reward Model in RLHF
- Abstract
- 摘要
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA
- Abstract
- 摘要
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
- Abstract
- 摘要
Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?
- Abstract
- 摘要
The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting
- Abstract
- 摘要
AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models
- Abstract
- 摘要
Sinusoidal Initialization, Time for a New Start
- Abstract
- 摘要
Leveraging LLM Inconsistency to Boost Pass@k Performance
- Abstract
- 摘要
Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
- Abstract
- 摘要
DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management
- Abstract
- 摘要
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
- Abstract
- 摘要
A3 : an Analytical Low-Rank Approximation Framework for Attention
- Abstract
- 摘要
An Empirical Study of Many-to-Many Summarization with Large Language Models
- Abstract
- 摘要
Fractured Chain-of-Thought Reasoning
- Abstract
- 摘要
From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents
- Abstract
- 摘要
ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning
- Abstract
- 摘要
Advancing Sequential Numerical Prediction in Autoregressive Models
- Abstract
- 摘要
KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025
- Abstract
- 摘要
Structure-Aware Corpus Construction and User-Perception-Aligned Metrics for Large-Language-Model Code Completion
- Abstract
- 摘要
Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs
- Abstract
- 摘要
Evaluatiing the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset
- Abstract
- 摘要
The Hidden Dangers of Browsing AI Agents
- Abstract
- 摘要
MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers
- Abstract
- 摘要
Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning
- Abstract
- 摘要
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
- Abstract
- 摘要
Role-Playing Evaluation for Large Language Models
- Abstract
- 摘要
ModernGBERT: German-only 1B Encoder Model Trained from Scratch
- Abstract
- 摘要
Tianyi: A Traditional Chinese Medicine all-rounder language model and its Real-World Clinical Practice
- Abstract
- 摘要
Cross-Cloud Data Privacy Protection: Optimizing Collaborative Mechanisms of AI Systems by Integrating Federated Learning and LLMs
- Abstract
- 摘要
ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models
- Abstract
- 摘要
WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?
- Abstract
- 摘要
Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA Generation
- Abstract
- 摘要
R3: Robust Rubric-Agnostic Reward Models
- Abstract
- 摘要
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
- Abstract
- 摘要
RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning
- Abstract
- 摘要
J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization
- Abstract
- 摘要
Thinkless: LLM Learns When to Think
- Abstract
- 摘要
Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
- Abstract
- 摘要
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
- Abstract
- 摘要
AdaptThink: Reasoning Models Can Learn When to Think
- Abstract
- 摘要
CIE: Controlling Language Model Text Generations Using Continuous Signals
- Abstract
- 摘要
Learnware of Language Models: Specialized Small Language Models Can Do Big
- Abstract
- 摘要
Automating construction contract review using knowledge graph-enhanced large language models
- Abstract
- 摘要
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak
- Abstract
- 摘要
Reinforcement Learning: An Overview
- Abstract
- 摘要
AXIS: Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
- Abstract
- 摘要
Mitigating Selection Bias with Node Pruning and Auxiliary Options
- Abstract
- 摘要
Task Facet Learning: A Structured Approach to Prompt Optimization
- Abstract
- 摘要
LLMScan: Causal Scan for LLM Misbehavior Detection
- Abstract
- 摘要
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
- Abstract
- 摘要
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution
- Abstract
- 摘要
Superhuman performance of a large language model on the reasoning tasks of a physician
- Abstract
- 摘要
A Pilot Empirical Study on When and How to Use Knowledge Graphs as Retrieval Augmented Generation
- Abstract
- 摘要
Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning
- Abstract
- 摘要
ARS: Automatic Routing Solver with Large Language Models
- Abstract
- 摘要
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
- Abstract
- 摘要
The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems
- Abstract
- 摘要
KunServe: Parameter-centric Memory Management for Efficient Memory Throttling Handling in LLM Serving
- Abstract
- 摘要
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
- Abstract
- 摘要
Beyond Single Pass, Looping Through Time: KG-IRAG with Iterative Knowledge Retrieval
- Abstract
- 摘要
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
- Abstract
- 摘要
A Self-Improving Coding Agent
- Abstract
- 摘要
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
- Abstract
- 摘要
Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws
- Abstract
- 摘要
Signatures of human-like processing in Transformer forward passes
- Abstract
- 摘要
GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
- Abstract
- 摘要
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
- Abstract
- 摘要
Large Linguistic Models: Investigating LLMs' metalinguistic abilities
- Abstract
- 摘要
Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey
- Abstract
- 摘要
AlignRAG: Leveraging Critique Learning for Evidence-Sensitive Retrieval-Augmented Reasoning
- Abstract
- 摘要
PlanFitting: Personalized Exercise Planning with Large Language Model-driven Conversational Agent
- Abstract
- 摘要
The Impact of Artificial Intelligence on the Evolution of Digital Education: A Comparative Study of OpenAI Text Generation Tools including ChatGPT, Bing Chat, Bard, and Ernie
- Abstract
- 摘要
MARFT: Multi-Agent Reinforcement Fine-Tuning
- Abstract
- 摘要
Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
- Abstract
- 摘要
On the Challenges of Fuzzing Techniques via Large Language Models
- Abstract
- 摘要
Hot PATE: Private Aggregation of Distributions for Diverse Task
- Abstract
- 摘要
Physics of Language Models: Part 1, Learning Hierarchical Language Structures
- Abstract
- 摘要
BAT: Learning to Reason about Spatial Sounds with Large Language Models
- Abstract
- 摘要
Can We Verify Step by Step for Incorrect Answer Detection?
- Abstract
- 摘要
Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance
- Abstract
- 摘要
ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
- Abstract
- 摘要
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
- Abstract
- 摘要
OR-Bench: An Over-Refusal Benchmark for Large Language Models
- Abstract
- 摘要
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
- Abstract
- 摘要
Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging
- Abstract
- 摘要
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
- Abstract
- 摘要
ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation
- Abstract
- 摘要
LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction
- Abstract
- 摘要
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
- Abstract
- 摘要
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
- Abstract
- 摘要
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
- Abstract
- 摘要
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks
- Abstract
- 摘要
Inference and Verbalization Functions During In-Context Learning
- Abstract
Hacking, The Lazy Way: LLM Augmented Pentesting
- Abstract
- 摘要
Enhancing LLM Evaluations: The Garbling Trick
- Abstract
- 摘要
Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies
- Abstract
- 摘要
Bias Similarity Across Large Language Models
- Abstract
- 摘要
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
- Abstract
- 摘要
ImageRAG: Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG
- Abstract
- 摘要
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
- Abstract
Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework
- Abstract
- 摘要
JetFormer: An Autoregressive Generative Model of Raw Images and Text
- Abstract
- 摘要
Training-Free Bayesianization for Low-Rank Adapters of Large Language Models
- Abstract
- 摘要
VLSBench: Unveiling Visual Leakage in Multimodal Safety
- Abstract
- 摘要
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization
- Abstract
- 摘要
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding
- Abstract
- 摘要
Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges
- Abstract
- 摘要
Learning to Learn Weight Generation via Local Consistency Diffusion
- Abstract
- 摘要
Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models
- Abstract
- 摘要
SwiftPrune: Hessian-Free Weight Pruning for Large Language Models
- Abstract
- 摘要
Joint Localization and Activation Editing for Low-Resource Fine-Tuning
- Abstract
- 摘要
Option-ID Based Elimination For Multiple Choice Questions
- Abstract
- 摘要
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks
- Abstract
- 摘要
`Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs
- Abstract
- 摘要
Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement
- Abstract
- 摘要
ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data
- Abstract
- 摘要
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models
- Abstract
- 摘要
KL Penalty Control via Perturbation for Direct Preference Optimization
- Abstract
- 摘要
To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models
- Abstract
- 摘要
Exploring the Potential of Encoder-free Architectures in 3D LMMs
- Abstract
- 摘要
FANformer: Improving Large Language Models Through Effective Periodicity Modeling
- Abstract
- 摘要
Language-Enhanced Representation Learning for Single-Cell Transcriptomics
- Abstract
- 摘要
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
- Abstract
- 摘要
HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models
- Abstract
- 摘要
MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings
- Abstract
- 摘要
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
- Abstract
- 摘要
Effectively Controlling Reasoning Models through Thinking Intervention
- Abstract
- 摘要
Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models
- Abstract
- 摘要
ImF: Implicit Fingerprint for Large Language Models
- Abstract
- 摘要
ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation
- Abstract
- 摘要
Detecting LLM-Generated Peer Reviews
- Abstract
- 摘要
Large Language Models Could Be Rote Learners
- Abstract
- 摘要
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
- Abstract
- 摘要
Mimic In-Context Learning for Multimodal Tasks
- Abstract
- 摘要
CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models
- Abstract
- 摘要
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training
- Abstract
- 摘要
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
- Abstract
- 摘要
OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents
- Abstract
- 摘要
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
- Abstract
- 摘要
BrainPrompt: Multi-Level Brain Prompt Enhancement for Neurological Condition Identification
- Abstract
- 摘要
Dynamic Early Exit in Reasoning Models
- Abstract
- 摘要
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
- Abstract
- 摘要
Process Reward Models That Think
- Abstract
- 摘要
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
- Abstract
- 摘要
Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models
- Abstract
- 摘要
ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant
- Abstract
- 摘要

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要